Why Do Safety Guardrails Degrade Across Languages?

Ameen Patel; Max Zhang; Sang T. Truong; Sanmi Koyejo

arxiv: 2605.17173 · v1 · pith:36OZG6QTnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· cs.LG

Why Do Safety Guardrails Degrade Across Languages?

Max Zhang , Ameen Patel , Sang T. Truong , Sanmi Koyejo This is my paper

Pith reviewed 2026-05-20 14:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords cross-lingual safetyitem response theorylarge language modelsrefusal behaviorjailbreak evaluationmultilingual robustnesssafety alignmentlatent variable model

0 comments

The pith

A statistical model shows safety failures in language models are often worse in English than low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Multi-Group Item Response Theory framework to separate the different influences on why large language models lose their safety guardrails when prompts are translated into other languages. It models four distinct factors that together determine whether a model refuses an unsafe request. Analysis of nearly two million responses across ten languages finds that safety refusals mostly rely on one shared underlying ability rather than independent skills for each harm category. This matters because common evaluation methods mix those factors together and hide the real sources of failure, such as specific prompt types that create larger gaps in certain languages.

Core claim

The Multi-Group IRT framework decouples safety-driving factors such as language-agnostic safety robustness, intrinsic prompt hardness, global language processing difficulty, and a prompt-specific cross-lingual safety gap. Exploratory Factor Analysis shows safety is primarily unidimensional. Across 61 model configurations and 10 languages, 22 configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses. High-gap prompts cluster in physical harm categories and lower-resource languages. The framework achieves AUC of 0.940 in predicting safe refusal.

What carries the argument

Multi-Group Item Response Theory framework that separates four latent factors to model the probability a model refuses an unsafe prompt.

Load-bearing premise

That four underlying factors are enough to explain all variation in refusal behavior and that a statistical check correctly shows safety works as one single trait across different harm types.

What would settle it

Refusal data collected on a fresh set of languages or prompt types where the model's predicted refusal rates deviate substantially from what actually happens, dropping predictive accuracy well below the reported level.

Figures

Figures reproduced from arXiv: 2605.17173 by Ameen Patel, Max Zhang, Sang T. Truong, Sanmi Koyejo.

**Figure 2.** Figure 2: Scree plots from EFA on the binary response matrix, aggregated over k generation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mean δjL by model family and language. Negative values (red) indicate the model is less safe in that language than its English baseline; positive (blue) indicates safer. Claude and GPT show strong English-centric alignment; Grok and DeepSeek show the reverse. 5.1 Exploratory Factor Analysis: safety is unidimensional Exploratory Factor Analysis (EFA) on the binary response matrix yields strong evidence for … view at source ↗

**Figure 4.** Figure 4: Stochastic response profiles by language. Left: deterministic vs. boundary re [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-lingual safety gap visualization with anchor constraints. Each panel shows [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Translation quality vs. cross-lingual safety gap ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Calibration and ROC curves. Left: The full IRT model (blue) tracks the diagonal [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: γL vs. τ·L under two τ priors. The Horseshoe prior (left) lowers the correlation (r = 0.081) compared to Normal (right, r = −0.191): confounding is mitigated. B Native translation This section contains red-teaming prompts that can be considered offensive [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Anchors have marginally higher Translation quality on average. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Native speakers validate embedding analysis trends. Both safety rate and [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Model comparison. (a) Convergence. (b) AIC/BIC (Lower = better). (c) 2PL [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Information functions. (a) Test information. (b) Item information. (c) Difficulty [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: ICCs: 1PL vs. 2PL for low-α (left) and high-α (right) items. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: GRM category response functions. Extreme categories dominate. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Ten generation passes concatenated. Red = Unsafe. Blue = Safe. White = In [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: JSR by language, aggregated across all model configurations and sorted from [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Jailbreak Success Rate heatmap across models (rows) and languages (columns). [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Cross-lingual safety gap visualization with anchor constraints. Each panel shows [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Correlation matrix across 18 safety categories. High positive correlations (red) [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Response consistency. Left: bimodal P(safe). Center: entropy. Right: entropy by language. M.2 Split-half reliability [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Split-half reliability. θ: r = 0.995. β: r = 0.985. τ: r = 0.904. M.3 Pass-to-pass τ stability 30 [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: τ stability across passes for each language. τ correlation across passes is between 0.886 and 0.895. M.4 Calibration 31 [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: IRT calibration. Overall r = 0.804, RMSE= 0.136. Per-language: r = 0.71–0.86. M.5 Temperature variance decomposition [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: Temperature decomposition. Between-temperature fraction: mean [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Overall JSR vs. θ. 1PL: r = −0.940, ρ = −0.880. 2PL: r = −0.859, ρ = −0.815 [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: Per-language summary. Left: |r| by language. Center: mean JSR. Right: pooled OLS (r = −0.875). N.1 Rank divergence: JSR vs. IRT ability [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗

**Figure 27.** Figure 27: Overall rank displacement between JSR and IRT ability rankings (2PL). Left: [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗

**Figure 28.** Figure 28: Mean rank displacement by model family and language. Red = JSR overestimates [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: Per-language rank divergence (RMSRD, QWK, Spearman [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: Translation quality vs. safety across four metrics. Translation quality has a modest effect on raw safety outcomes. P Cultural / conceptual gaps This section contains red-teaming prompts that can be considered offensive. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗

**Figure 31.** Figure 31: LOLO AUC-ROC. Baselines collapse to 0.500; IRT maintains 0.767–0.908. [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

**Figure 32.** Figure 32: LOFO AUC-ROC. Grok is hardest to predict. [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Contribution of τ to predictive performance across three CV regimes. Left (LOFO): τ improves AUC for all held-out model families (mean ∆ = +0.0266). Right (Random): consistent improvement (∆ = +0.0516). Q.2 LOLO, LOFO, random table 37 [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: Scree plot of the τ matrix (275 prompts × 9 languages). Two components exceed eigenvalue 1, explaining 53% of variance [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗

**Figure 35.** Figure 35: Per-language loadings on the first four principal components of [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗

read the original abstract

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($\theta$), intrinsic prompt hardness ($\beta$), global language processing difficulty ($\gamma$), and a prompt-specific cross-lingual safety gap ($\tau$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$\tau$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $\tau$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $\tau$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses Multi-Group IRT to split safety refusal drivers across languages and finds 22 model setups weaker in English than low-resource ones, with good predictive numbers but thin support for the unidimensionality claim.

read the letter

This paper introduces a Multi-Group Item Response Theory model to dissect why safety guardrails in LLMs weaken across languages. Instead of lumping everything into jailbreak success rates, they separate out language-agnostic robustness, prompt hardness, language processing difficulty, and a prompt-specific cross-lingual gap. On a dataset of 1.9 million rows from 61 model configurations and 10 languages, they get an AUC of 0.94 for predicting refusals. What is new is the application of this IRT setup to safety evaluation and the result that 22 configurations show greater vulnerability in English than in low-resource languages. That runs against the common assumption. They also find higher response entropy in low-resource languages and that high-gap prompts often involve physical harm topics. Some validation comes from native speaker checks on translations and cross-dataset checks on clusters. The work does well at scale and in providing a predictive tool that beats simpler baselines. It gives a way to spot concept-language vulnerabilities that aggregate metrics miss, which could help with better dataset design for multilingual safety. The soft spots are around the modeling assumptions. The claim that safety is primarily unidimensional rests on exploratory factor analysis, but that method can miss secondary dimensions or residual correlations from cultural or translation effects. Those could influence the cross-lingual gap estimates without being fully captured. There is also no reported uncertainty on the latent parameters, which makes the cluster interpretations and the English vulnerability finding a bit harder to assess for robustness. The parameters are fit on the refusal data, so while held-out prediction helps, some circularity remains in the exploratory parts. This is for researchers focused on multilingual LLM safety and evaluation. Readers who care about practical deployment gaps and diagnostic tools will find it useful. It deserves a serious referee because the data volume and predictive results are substantial, and the approach addresses a real issue even if the factor structure needs tighter validation. I would recommend putting it through peer review, asking reviewers to check the EFA sensitivity and perhaps suggest adding bootstrap or other uncertainty estimates for the key parameters.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Multi-Group Item Response Theory (IRT) framework to decouple factors behind safety degradation in LLMs across languages, including language-agnostic robustness (theta), prompt hardness (beta), language difficulty (gamma), and cross-lingual safety gap (tau). Analyzing 1.9 million rows from the MultiJail dataset across 61 model configurations and 10 languages, Exploratory Factor Analysis indicates safety is primarily unidimensional. The study finds that 22 model configurations are more vulnerable in English than in low-resource languages, with high-tau prompts clustering in physical harm categories, and achieves an AUC of 0.940 in predicting safe refusals.

Significance. If the IRT model assumptions hold, particularly the unidimensionality of safety and the validity of the four latent factors, this work offers a significant advance over aggregate metrics like Jailbreak Success Rate by providing interpretable, disentangled insights into cross-lingual safety vulnerabilities. The large-scale evaluation and strong predictive performance support its potential to inform targeted improvements in multilingual safety alignment and dataset design. The counter-intuitive finding regarding English vulnerability and the analysis of mistranslations add novel perspectives.

major comments (2)

[Exploratory Factor Analysis] The claim that safety is primarily unidimensional rests on EFA results, but without reporting the variance explained by the first factor or conducting tests for residual correlations between harm categories, it is unclear whether local independence holds. This is critical because unmodeled correlations (e.g., in physical harm prompts due to translation artifacts) could affect the reliability of the prompt-specific tau parameter and the interpretation of cross-lingual gaps.
[Multi-Group IRT framework] The four latent factors (theta, beta, gamma, tau) are estimated from the refusal data; while held-out predictive validation is reported, the manuscript lacks uncertainty quantification on the latent variable estimates themselves. This weakens the post-hoc interpretations of tau clusters and the identification of 22 model configurations more vulnerable in English.

minor comments (2)

[Abstract] The abstract mentions 'high-τ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages'; consider specifying the exact clustering method or threshold used for 'high-τ'.
[Methods] Clarify the definition and estimation procedure for the global language processing difficulty parameter γ in the IRT model equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below, we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses

Referee: [Exploratory Factor Analysis] The claim that safety is primarily unidimensional rests on EFA results, but without reporting the variance explained by the first factor or conducting tests for residual correlations between harm categories, it is unclear whether local independence holds. This is critical because unmodeled correlations (e.g., in physical harm prompts due to translation artifacts) could affect the reliability of the prompt-specific tau parameter and the interpretation of cross-lingual gaps.

Authors: We agree that reporting the variance explained by the first factor and assessing residual correlations would enhance the support for unidimensionality and local independence. We will include these analyses in the revised manuscript, specifically reporting the proportion of variance explained and examining residual correlations among harm categories to ensure they do not undermine the tau parameter interpretations. revision: yes
Referee: [Multi-Group IRT framework] The four latent factors (theta, beta, gamma, tau) are estimated from the refusal data; while held-out predictive validation is reported, the manuscript lacks uncertainty quantification on the latent variable estimates themselves. This weakens the post-hoc interpretations of tau clusters and the identification of 22 model configurations more vulnerable in English.

Authors: We acknowledge that uncertainty quantification on the latent estimates would bolster the post-hoc analyses. We will incorporate this in the revision by providing standard errors or confidence intervals for the estimated factors, particularly for tau and the identification of vulnerable models, using appropriate statistical methods from the IRT framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Multi-Group IRT derivation

full rationale

The paper fits a Multi-Group IRT model to refusal data from the MultiJail dataset and applies Exploratory Factor Analysis to assess unidimensionality of safety. The key results, including the reported AUC of 0.940 for predicting safe refusal, rely on predictive validation using held-out prompts that is independent of the fitted parameters used for interpretation. No equations or steps reduce by construction to the inputs, no self-citations are load-bearing for the central claims, and the framework uses standard latent variable techniques without self-definitional loops or renaming of known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model introduces four latent variables whose values are estimated from refusal data; no new physical entities are postulated. The main free parameters are the per-model theta, per-prompt beta, per-language gamma, and per-prompt-language tau.

free parameters (2)

language-agnostic safety robustness (theta)
Estimated per model configuration from refusal patterns; central to all downstream claims about relative vulnerability.
prompt-specific cross-lingual safety gap (tau)
Fitted per prompt-language pair; used to identify high-bias clusters in physical harm and lower-resource languages.

axioms (1)

domain assumption Safety refusal behavior is adequately captured by a unidimensional latent trait plus three additional factors.
Invoked via Exploratory Factor Analysis results; if multidimensionality is present, the decoupling interpretation weakens.

pith-pipeline@v0.9.0 · 5848 in / 1389 out tokens · 28905 ms · 2026-05-20T14:13:25.315100+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness (θ), intrinsic prompt hardness (β), global language processing difficulty (γ), and a prompt-specific cross-lingual safety gap (τ).
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

[1]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–445, Singapore

Wang, Wenxuan and Tu, Zhaopeng and Chen, Chang and Yuan, Youliang and Huang, Jen-tse and Jiao, Wenxiang and Lyu, Michael R. , month = jun, year =. All. doi:10.48550/arXiv.2310.00905 , abstract =

work page doi:10.48550/arxiv.2310.00905
[2]

Psychometrika , author =

An. Psychometrika , author =. 1974 , pages =. doi:10.1007/BF02291575 , abstract =

work page doi:10.1007/bf02291575 1974
[3]

Educational and Psychological Measurement , author =

Anchor. Educational and Psychological Measurement , author =. 2015 , pages =. doi:10.1177/0013164414529792 , abstract =

work page doi:10.1177/0013164414529792 2015
[4]

Lord, F. M. , month = nov, year =. Applications of. doi:10.4324/9780203056615 , language =

work page doi:10.4324/9780203056615
[5]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , month = oct, year =. Catastrophic. doi:10.48550/arXiv.2310.06987 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06987
[6]

De Souza, José G

Rei, Ricardo and C. De Souza, José G. and Alves, Duarte and Zerva, Chrysoula and Farinha, Ana C and Glushkova, Taisiya and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , year =. Proceedings of the. doi:10.18653/v1/2022.wmt-1.52 , language =

work page doi:10.18653/v1/2022.wmt-1.52 2022
[7]

Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task

Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C. and Maroti, Christine and Souza, José G. C. de and Glushkova, Taisiya and Alves, Duarte M. and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , month = sep, year =. doi:10.48550/arXiv.2209.06243 , abstract =

work page doi:10.48550/arxiv.2209.06243
[8]

Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =

Chouldechova, Alexandra and Cooper, A. Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =. Comparison Requires Valid Measurement:. doi:10.48550/arXiv.2601.18076 , abstract =

work page doi:10.48550/arxiv.2601.18076
[9]

Peng, Qiwei and Søgaard, Anders , month = oct, year =. Concept. doi:10.48550/arXiv.2410.01079 , abstract =

work page doi:10.48550/arxiv.2410.01079
[10]

Applied Psychological Measurement , author =

Effects of. Applied Psychological Measurement , author =. 1984 , pages =. doi:10.1177/014662168400800201 , abstract =

work page doi:10.1177/014662168400800201 1984
[11]

doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

ETS Research Bulletin Series , author =. doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

work page doi:10.1002/j.2333-8504.1968.tb00153.x 1968
[12]

Fantastic

Truong, Sang and Tu, Yuheng and Hardy, Michael and Reuel, Anka and Tang, Zeyu and Burapacheep, Jirayu and Perera, Jonathan and Uwakwe, Chibuike and Domingue, Ben and Haber, Nick and Koyejo, Sanmi , month = nov, year =. Fantastic. doi:10.48550/arXiv.2511.16842 , abstract =

work page doi:10.48550/arxiv.2511.16842
[13]

The American Journal of Psychology , author =

". The American Journal of Psychology , author =. 1904 , pages =. doi:10.2307/1412107 , number =

work page doi:10.2307/1412107 1904
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , month = dec, year =. Judging. doi:10.48550/arXiv.2306.05685 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685
[15]

Language-agnostic BERT sentence embedding, 2022

Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , month = mar, year =. Language-agnostic. doi:10.48550/arXiv.2007.01852 , abstract =

work page doi:10.48550/arxiv.2007.01852 2007
[16]

doi:10.48550/arXiv.2508.12733 , abstract =

Ning, Zhiyuan and Gu, Tianle and Song, Jiaxin and Hong, Shixin and Li, Lingyu and Liu, Huacan and Li, Jie and Wang, Yixu and Lingyu, Meng and Teng, Yan and Wang, Yingchun , month = aug, year =. doi:10.48550/arXiv.2508.12733 , abstract =

work page doi:10.48550/arxiv.2508.12733
[17]

Educational and Psychological Measurement , author =

Little. Educational and Psychological Measurement , author =. 1974 , pages =. doi:10.1177/001316447403400115 , language =

work page doi:10.1177/001316447403400115 1974
[18]

Multilingualjailbreakchallengesinlargelanguagemodels

Deng, Yue and Zhang, Wenxuan and Pan, Sinno Jialin and Bing, Lidong , month = mar, year =. Multilingual. doi:10.48550/arXiv.2310.06474 , abstract =

work page doi:10.48550/arxiv.2310.06474
[19]

Spiliopoulou, Evangelia and Fogliato, Riccardo and Burnsky, Hanna and Soliman, Tamer and Ma, Jie and Horwood, Graham and Ballesteros, Miguel , month = aug, year =. Play. doi:10.48550/arXiv.2508.06709 , abstract =

work page doi:10.48550/arxiv.2508.06709
[20]

Pyro: Deep Universal Probabilistic Programming

Bingham, Eli and Chen, Jonathan P. and Jankowiak, Martin and Obermeyer, Fritz and Pradhan, Neeraj and Karaletsos, Theofanis and Singh, Rohit and Szerlip, Paul and Horsfall, Paul and Goodman, Noah D. , month = oct, year =. Pyro:. doi:10.48550/arXiv.1810.09538 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.09538
[21]

Wang, Xinpeng and Wang, Mingyang and Liu, Yihong and Schütze, Hinrich and Plank, Barbara , month = feb, year =. Refusal. doi:10.48550/arXiv.2505.17306 , abstract =

work page doi:10.48550/arxiv.2505.17306
[22]

doi:10.4324/9780203357811 , url =

Differential Item Functioning , year =. doi:10.4324/9780203357811 , url =

work page doi:10.4324/9780203357811
[23]

Kendall,M. G. , title =. 1948 , pages =

work page 1948
[24]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , month = oct, year =. Refusal in. doi:10.48550/arXiv.2406.11717 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11717
[25]

Reliable and

Truong, Sang and Tu, Yuheng and Liang, Percy and Li, Bo and Koyejo, Sanmi , month = mar, year =. Reliable and. doi:10.48550/arXiv.2503.13335 , abstract =

work page doi:10.48550/arxiv.2503.13335
[26]

Safer or

Chen, Hongyu and Goldfarb-Tarrant, Seraphina , month = jul, year =. Safer or. doi:10.48550/arXiv.2503.09347 , abstract =

work page doi:10.48550/arxiv.2503.09347
[27]

, year =

Baker, Frank B. , year =. The

work page
[28]

Wollschläger, Tom and Elstner, Jannes and Geisler, Simon and Cohen-Addad, Vincent and Günnemann, Stephan and Gasteiger, Johannes , month = feb, year =. The. doi:10.48550/arXiv.2502.17420 , abstract =

work page doi:10.48550/arxiv.2502.17420
[29]

Pan, Wenbo and Liu, Zhichao and Chen, Qiguang and Zhou, Xiangyang and Yu, Haining and Jia, Xiaohua , month = may, year =. The. doi:10.48550/arXiv.2502.09674 , abstract =

work page doi:10.48550/arxiv.2502.09674
[30]

Larsen, Erik , month = dec, year =. The. doi:10.48550/arXiv.2512.12066 , abstract =

work page doi:10.48550/arxiv.2512.12066
[31]

Journal of the American Statistical Association , year =

Variational. Journal of the American Statistical Association , author =. 2017 , pages =. doi:10.1080/01621459.2017.1285773 , language =

work page doi:10.1080/01621459.2017.1285773 2017
[32]

What is in

He, Luxi and Xia, Mengzhou and Henderson, Peter , month = aug, year =. What is in. doi:10.48550/arXiv.2404.01099 , abstract =

work page doi:10.48550/arxiv.2404.01099
[33]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F

Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F. T. , month = oct, year =. doi:10.48550/arXiv.2310.10482 , abstract =

work page doi:10.48550/arxiv.2310.10482

[1] [1]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–445, Singapore

Wang, Wenxuan and Tu, Zhaopeng and Chen, Chang and Yuan, Youliang and Huang, Jen-tse and Jiao, Wenxiang and Lyu, Michael R. , month = jun, year =. All. doi:10.48550/arXiv.2310.00905 , abstract =

work page doi:10.48550/arxiv.2310.00905

[2] [2]

Psychometrika , author =

An. Psychometrika , author =. 1974 , pages =. doi:10.1007/BF02291575 , abstract =

work page doi:10.1007/bf02291575 1974

[3] [3]

Educational and Psychological Measurement , author =

Anchor. Educational and Psychological Measurement , author =. 2015 , pages =. doi:10.1177/0013164414529792 , abstract =

work page doi:10.1177/0013164414529792 2015

[4] [4]

Lord, F. M. , month = nov, year =. Applications of. doi:10.4324/9780203056615 , language =

work page doi:10.4324/9780203056615

[5] [5]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , month = oct, year =. Catastrophic. doi:10.48550/arXiv.2310.06987 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06987

[6] [6]

De Souza, José G

Rei, Ricardo and C. De Souza, José G. and Alves, Duarte and Zerva, Chrysoula and Farinha, Ana C and Glushkova, Taisiya and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , year =. Proceedings of the. doi:10.18653/v1/2022.wmt-1.52 , language =

work page doi:10.18653/v1/2022.wmt-1.52 2022

[7] [7]

Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task

Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C. and Maroti, Christine and Souza, José G. C. de and Glushkova, Taisiya and Alves, Duarte M. and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , month = sep, year =. doi:10.48550/arXiv.2209.06243 , abstract =

work page doi:10.48550/arxiv.2209.06243

[8] [8]

Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =

Chouldechova, Alexandra and Cooper, A. Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =. Comparison Requires Valid Measurement:. doi:10.48550/arXiv.2601.18076 , abstract =

work page doi:10.48550/arxiv.2601.18076

[9] [9]

Peng, Qiwei and Søgaard, Anders , month = oct, year =. Concept. doi:10.48550/arXiv.2410.01079 , abstract =

work page doi:10.48550/arxiv.2410.01079

[10] [10]

Applied Psychological Measurement , author =

Effects of. Applied Psychological Measurement , author =. 1984 , pages =. doi:10.1177/014662168400800201 , abstract =

work page doi:10.1177/014662168400800201 1984

[11] [11]

doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

ETS Research Bulletin Series , author =. doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

work page doi:10.1002/j.2333-8504.1968.tb00153.x 1968

[12] [12]

Fantastic

Truong, Sang and Tu, Yuheng and Hardy, Michael and Reuel, Anka and Tang, Zeyu and Burapacheep, Jirayu and Perera, Jonathan and Uwakwe, Chibuike and Domingue, Ben and Haber, Nick and Koyejo, Sanmi , month = nov, year =. Fantastic. doi:10.48550/arXiv.2511.16842 , abstract =

work page doi:10.48550/arxiv.2511.16842

[13] [13]

The American Journal of Psychology , author =

". The American Journal of Psychology , author =. 1904 , pages =. doi:10.2307/1412107 , number =

work page doi:10.2307/1412107 1904

[14] [14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , month = dec, year =. Judging. doi:10.48550/arXiv.2306.05685 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685

[15] [15]

Language-agnostic BERT sentence embedding, 2022

Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , month = mar, year =. Language-agnostic. doi:10.48550/arXiv.2007.01852 , abstract =

work page doi:10.48550/arxiv.2007.01852 2007

[16] [16]

doi:10.48550/arXiv.2508.12733 , abstract =

Ning, Zhiyuan and Gu, Tianle and Song, Jiaxin and Hong, Shixin and Li, Lingyu and Liu, Huacan and Li, Jie and Wang, Yixu and Lingyu, Meng and Teng, Yan and Wang, Yingchun , month = aug, year =. doi:10.48550/arXiv.2508.12733 , abstract =

work page doi:10.48550/arxiv.2508.12733

[17] [17]

Educational and Psychological Measurement , author =

Little. Educational and Psychological Measurement , author =. 1974 , pages =. doi:10.1177/001316447403400115 , language =

work page doi:10.1177/001316447403400115 1974

[18] [18]

Multilingualjailbreakchallengesinlargelanguagemodels

Deng, Yue and Zhang, Wenxuan and Pan, Sinno Jialin and Bing, Lidong , month = mar, year =. Multilingual. doi:10.48550/arXiv.2310.06474 , abstract =

work page doi:10.48550/arxiv.2310.06474

[19] [19]

Spiliopoulou, Evangelia and Fogliato, Riccardo and Burnsky, Hanna and Soliman, Tamer and Ma, Jie and Horwood, Graham and Ballesteros, Miguel , month = aug, year =. Play. doi:10.48550/arXiv.2508.06709 , abstract =

work page doi:10.48550/arxiv.2508.06709

[20] [20]

Pyro: Deep Universal Probabilistic Programming

Bingham, Eli and Chen, Jonathan P. and Jankowiak, Martin and Obermeyer, Fritz and Pradhan, Neeraj and Karaletsos, Theofanis and Singh, Rohit and Szerlip, Paul and Horsfall, Paul and Goodman, Noah D. , month = oct, year =. Pyro:. doi:10.48550/arXiv.1810.09538 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.09538

[21] [21]

Wang, Xinpeng and Wang, Mingyang and Liu, Yihong and Schütze, Hinrich and Plank, Barbara , month = feb, year =. Refusal. doi:10.48550/arXiv.2505.17306 , abstract =

work page doi:10.48550/arxiv.2505.17306

[22] [22]

doi:10.4324/9780203357811 , url =

Differential Item Functioning , year =. doi:10.4324/9780203357811 , url =

work page doi:10.4324/9780203357811

[23] [23]

Kendall,M. G. , title =. 1948 , pages =

work page 1948

[24] [24]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , month = oct, year =. Refusal in. doi:10.48550/arXiv.2406.11717 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11717

[25] [25]

Reliable and

Truong, Sang and Tu, Yuheng and Liang, Percy and Li, Bo and Koyejo, Sanmi , month = mar, year =. Reliable and. doi:10.48550/arXiv.2503.13335 , abstract =

work page doi:10.48550/arxiv.2503.13335

[26] [26]

Safer or

Chen, Hongyu and Goldfarb-Tarrant, Seraphina , month = jul, year =. Safer or. doi:10.48550/arXiv.2503.09347 , abstract =

work page doi:10.48550/arxiv.2503.09347

[27] [27]

, year =

Baker, Frank B. , year =. The

work page

[28] [28]

Wollschläger, Tom and Elstner, Jannes and Geisler, Simon and Cohen-Addad, Vincent and Günnemann, Stephan and Gasteiger, Johannes , month = feb, year =. The. doi:10.48550/arXiv.2502.17420 , abstract =

work page doi:10.48550/arxiv.2502.17420

[29] [29]

Pan, Wenbo and Liu, Zhichao and Chen, Qiguang and Zhou, Xiangyang and Yu, Haining and Jia, Xiaohua , month = may, year =. The. doi:10.48550/arXiv.2502.09674 , abstract =

work page doi:10.48550/arxiv.2502.09674

[30] [30]

Larsen, Erik , month = dec, year =. The. doi:10.48550/arXiv.2512.12066 , abstract =

work page doi:10.48550/arxiv.2512.12066

[31] [31]

Journal of the American Statistical Association , year =

Variational. Journal of the American Statistical Association , author =. 2017 , pages =. doi:10.1080/01621459.2017.1285773 , language =

work page doi:10.1080/01621459.2017.1285773 2017

[32] [32]

What is in

He, Luxi and Xia, Mengzhou and Henderson, Peter , month = aug, year =. What is in. doi:10.48550/arXiv.2404.01099 , abstract =

work page doi:10.48550/arxiv.2404.01099

[33] [33]

and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F

Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F. T. , month = oct, year =. doi:10.48550/arXiv.2310.10482 , abstract =

work page doi:10.48550/arxiv.2310.10482