pith. sign in

arxiv: 2605.17173 · v1 · pith:36OZG6QTnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI· cs.LG

Why Do Safety Guardrails Degrade Across Languages?

Pith reviewed 2026-05-20 14:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords cross-lingual safetyitem response theorylarge language modelsrefusal behaviorjailbreak evaluationmultilingual robustnesssafety alignmentlatent variable model
0
0 comments X

The pith

A statistical model shows safety failures in language models are often worse in English than low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Multi-Group Item Response Theory framework to separate the different influences on why large language models lose their safety guardrails when prompts are translated into other languages. It models four distinct factors that together determine whether a model refuses an unsafe request. Analysis of nearly two million responses across ten languages finds that safety refusals mostly rely on one shared underlying ability rather than independent skills for each harm category. This matters because common evaluation methods mix those factors together and hide the real sources of failure, such as specific prompt types that create larger gaps in certain languages.

Core claim

The Multi-Group IRT framework decouples safety-driving factors such as language-agnostic safety robustness, intrinsic prompt hardness, global language processing difficulty, and a prompt-specific cross-lingual safety gap. Exploratory Factor Analysis shows safety is primarily unidimensional. Across 61 model configurations and 10 languages, 22 configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses. High-gap prompts cluster in physical harm categories and lower-resource languages. The framework achieves AUC of 0.940 in predicting safe refusal.

What carries the argument

Multi-Group Item Response Theory framework that separates four latent factors to model the probability a model refuses an unsafe prompt.

Load-bearing premise

That four underlying factors are enough to explain all variation in refusal behavior and that a statistical check correctly shows safety works as one single trait across different harm types.

What would settle it

Refusal data collected on a fresh set of languages or prompt types where the model's predicted refusal rates deviate substantially from what actually happens, dropping predictive accuracy well below the reported level.

Figures

Figures reproduced from arXiv: 2605.17173 by Ameen Patel, Max Zhang, Sang T. Truong, Sanmi Koyejo.

Figure 1
Figure 1. Figure 1: Overview of the multi-group IRT framework for multilingual safety decomposi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scree plots from EFA on the binary response matrix, aggregated over k generation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean δjL by model family and language. Negative values (red) indicate the model is less safe in that language than its English baseline; positive (blue) indicates safer. Claude and GPT show strong English-centric alignment; Grok and DeepSeek show the reverse. 5.1 Exploratory Factor Analysis: safety is unidimensional Exploratory Factor Analysis (EFA) on the binary response matrix yields strong evidence for … view at source ↗
Figure 4
Figure 4. Figure 4: Stochastic response profiles by language. Left: deterministic vs. boundary re [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-lingual safety gap visualization with anchor constraints. Each panel shows [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Translation quality vs. cross-lingual safety gap ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Calibration and ROC curves. Left: The full IRT model (blue) tracks the diagonal [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: γL vs. τ·L under two τ priors. The Horseshoe prior (left) lowers the correlation (r = 0.081) compared to Normal (right, r = −0.191): confounding is mitigated. B Native translation This section contains red-teaming prompts that can be considered offensive [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Anchors have marginally higher Translation quality on average. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Native speakers validate embedding analysis trends. Both safety rate and [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Model comparison. (a) Convergence. (b) AIC/BIC (Lower = better). (c) 2PL [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Information functions. (a) Test information. (b) Item information. (c) Difficulty [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: ICCs: 1PL vs. 2PL for low-α (left) and high-α (right) items. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: GRM category response functions. Extreme categories dominate. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ten generation passes concatenated. Red = Unsafe. Blue = Safe. White = In [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: JSR by language, aggregated across all model configurations and sorted from [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Jailbreak Success Rate heatmap across models (rows) and languages (columns). [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cross-lingual safety gap visualization with anchor constraints. Each panel shows [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Correlation matrix across 18 safety categories. High positive correlations (red) [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Response consistency. Left: bimodal P(safe). Center: entropy. Right: entropy by language. M.2 Split-half reliability [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Split-half reliability. θ: r = 0.995. β: r = 0.985. τ: r = 0.904. M.3 Pass-to-pass τ stability 30 [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: τ stability across passes for each language. τ correlation across passes is between 0.886 and 0.895. M.4 Calibration 31 [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: IRT calibration. Overall r = 0.804, RMSE= 0.136. Per-language: r = 0.71–0.86. M.5 Temperature variance decomposition [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Temperature decomposition. Between-temperature fraction: mean [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Overall JSR vs. θ. 1PL: r = −0.940, ρ = −0.880. 2PL: r = −0.859, ρ = −0.815 [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Per-language summary. Left: |r| by language. Center: mean JSR. Right: pooled OLS (r = −0.875). N.1 Rank divergence: JSR vs. IRT ability [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Overall rank displacement between JSR and IRT ability rankings (2PL). Left: [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Mean rank displacement by model family and language. Red = JSR overestimates [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Per-language rank divergence (RMSRD, QWK, Spearman [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Translation quality vs. safety across four metrics. Translation quality has a modest effect on raw safety outcomes. P Cultural / conceptual gaps This section contains red-teaming prompts that can be considered offensive. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: LOLO AUC-ROC. Baselines collapse to 0.500; IRT maintains 0.767–0.908. [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: LOFO AUC-ROC. Grok is hardest to predict. [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Contribution of τ to predictive performance across three CV regimes. Left (LOFO): τ improves AUC for all held-out model families (mean ∆ = +0.0266). Right (Random): consistent improvement (∆ = +0.0516). Q.2 LOLO, LOFO, random table 37 [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Scree plot of the τ matrix (275 prompts × 9 languages). Two components exceed eigenvalue 1, explaining 53% of variance [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Per-language loadings on the first four principal components of [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
read the original abstract

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($\theta$), intrinsic prompt hardness ($\beta$), global language processing difficulty ($\gamma$), and a prompt-specific cross-lingual safety gap ($\tau$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$\tau$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $\tau$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $\tau$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Multi-Group Item Response Theory (IRT) framework to decouple factors behind safety degradation in LLMs across languages, including language-agnostic robustness (theta), prompt hardness (beta), language difficulty (gamma), and cross-lingual safety gap (tau). Analyzing 1.9 million rows from the MultiJail dataset across 61 model configurations and 10 languages, Exploratory Factor Analysis indicates safety is primarily unidimensional. The study finds that 22 model configurations are more vulnerable in English than in low-resource languages, with high-tau prompts clustering in physical harm categories, and achieves an AUC of 0.940 in predicting safe refusals.

Significance. If the IRT model assumptions hold, particularly the unidimensionality of safety and the validity of the four latent factors, this work offers a significant advance over aggregate metrics like Jailbreak Success Rate by providing interpretable, disentangled insights into cross-lingual safety vulnerabilities. The large-scale evaluation and strong predictive performance support its potential to inform targeted improvements in multilingual safety alignment and dataset design. The counter-intuitive finding regarding English vulnerability and the analysis of mistranslations add novel perspectives.

major comments (2)
  1. [Exploratory Factor Analysis] The claim that safety is primarily unidimensional rests on EFA results, but without reporting the variance explained by the first factor or conducting tests for residual correlations between harm categories, it is unclear whether local independence holds. This is critical because unmodeled correlations (e.g., in physical harm prompts due to translation artifacts) could affect the reliability of the prompt-specific tau parameter and the interpretation of cross-lingual gaps.
  2. [Multi-Group IRT framework] The four latent factors (theta, beta, gamma, tau) are estimated from the refusal data; while held-out predictive validation is reported, the manuscript lacks uncertainty quantification on the latent variable estimates themselves. This weakens the post-hoc interpretations of tau clusters and the identification of 22 model configurations more vulnerable in English.
minor comments (2)
  1. [Abstract] The abstract mentions 'high-τ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages'; consider specifying the exact clustering method or threshold used for 'high-τ'.
  2. [Methods] Clarify the definition and estimation procedure for the global language processing difficulty parameter γ in the IRT model equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below, we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses
  1. Referee: [Exploratory Factor Analysis] The claim that safety is primarily unidimensional rests on EFA results, but without reporting the variance explained by the first factor or conducting tests for residual correlations between harm categories, it is unclear whether local independence holds. This is critical because unmodeled correlations (e.g., in physical harm prompts due to translation artifacts) could affect the reliability of the prompt-specific tau parameter and the interpretation of cross-lingual gaps.

    Authors: We agree that reporting the variance explained by the first factor and assessing residual correlations would enhance the support for unidimensionality and local independence. We will include these analyses in the revised manuscript, specifically reporting the proportion of variance explained and examining residual correlations among harm categories to ensure they do not undermine the tau parameter interpretations. revision: yes

  2. Referee: [Multi-Group IRT framework] The four latent factors (theta, beta, gamma, tau) are estimated from the refusal data; while held-out predictive validation is reported, the manuscript lacks uncertainty quantification on the latent variable estimates themselves. This weakens the post-hoc interpretations of tau clusters and the identification of 22 model configurations more vulnerable in English.

    Authors: We acknowledge that uncertainty quantification on the latent estimates would bolster the post-hoc analyses. We will incorporate this in the revision by providing standard errors or confidence intervals for the estimated factors, particularly for tau and the identification of vulnerable models, using appropriate statistical methods from the IRT framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Multi-Group IRT derivation

full rationale

The paper fits a Multi-Group IRT model to refusal data from the MultiJail dataset and applies Exploratory Factor Analysis to assess unidimensionality of safety. The key results, including the reported AUC of 0.940 for predicting safe refusal, rely on predictive validation using held-out prompts that is independent of the fitted parameters used for interpretation. No equations or steps reduce by construction to the inputs, no self-citations are load-bearing for the central claims, and the framework uses standard latent variable techniques without self-definitional loops or renaming of known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model introduces four latent variables whose values are estimated from refusal data; no new physical entities are postulated. The main free parameters are the per-model theta, per-prompt beta, per-language gamma, and per-prompt-language tau.

free parameters (2)
  • language-agnostic safety robustness (theta)
    Estimated per model configuration from refusal patterns; central to all downstream claims about relative vulnerability.
  • prompt-specific cross-lingual safety gap (tau)
    Fitted per prompt-language pair; used to identify high-bias clusters in physical harm and lower-resource languages.
axioms (1)
  • domain assumption Safety refusal behavior is adequately captured by a unidimensional latent trait plus three additional factors.
    Invoked via Exploratory Factor Analysis results; if multidimensionality is present, the decoupling interpretation weakens.

pith-pipeline@v0.9.0 · 5848 in / 1389 out tokens · 28905 ms · 2026-05-20T14:13:25.315100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

  1. [1]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–445, Singapore

    Wang, Wenxuan and Tu, Zhaopeng and Chen, Chang and Yuan, Youliang and Huang, Jen-tse and Jiao, Wenxiang and Lyu, Michael R. , month = jun, year =. All. doi:10.48550/arXiv.2310.00905 , abstract =

  2. [2]

    Psychometrika , author =

    An. Psychometrika , author =. 1974 , pages =. doi:10.1007/BF02291575 , abstract =

  3. [3]

    Educational and Psychological Measurement , author =

    Anchor. Educational and Psychological Measurement , author =. 2015 , pages =. doi:10.1177/0013164414529792 , abstract =

  4. [4]

    Lord, F. M. , month = nov, year =. Applications of. doi:10.4324/9780203056615 , language =

  5. [5]

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , month = oct, year =. Catastrophic. doi:10.48550/arXiv.2310.06987 , abstract =

  6. [6]

    De Souza, José G

    Rei, Ricardo and C. De Souza, José G. and Alves, Duarte and Zerva, Chrysoula and Farinha, Ana C and Glushkova, Taisiya and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , year =. Proceedings of the. doi:10.18653/v1/2022.wmt-1.52 , language =

  7. [7]

    Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task

    Rei, Ricardo and Treviso, Marcos and Guerreiro, Nuno M. and Zerva, Chrysoula and Farinha, Ana C. and Maroti, Christine and Souza, José G. C. de and Glushkova, Taisiya and Alves, Duarte M. and Lavie, Alon and Coheur, Luisa and Martins, André F. T. , month = sep, year =. doi:10.48550/arXiv.2209.06243 , abstract =

  8. [8]

    Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =

    Chouldechova, Alexandra and Cooper, A. Feder and Barocas, Solon and Palia, Abhinav and Vann, Dan and Wallach, Hanna , month = jan, year =. Comparison Requires Valid Measurement:. doi:10.48550/arXiv.2601.18076 , abstract =

  9. [9]

    Peng, Qiwei and Søgaard, Anders , month = oct, year =. Concept. doi:10.48550/arXiv.2410.01079 , abstract =

  10. [10]

    Applied Psychological Measurement , author =

    Effects of. Applied Psychological Measurement , author =. 1984 , pages =. doi:10.1177/014662168400800201 , abstract =

  11. [11]

    doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

    ETS Research Bulletin Series , author =. doi:10.1002/j.2333-8504.1968.tb00153.x , abstract =

  12. [12]

    Fantastic

    Truong, Sang and Tu, Yuheng and Hardy, Michael and Reuel, Anka and Tang, Zeyu and Burapacheep, Jirayu and Perera, Jonathan and Uwakwe, Chibuike and Domingue, Ben and Haber, Nick and Koyejo, Sanmi , month = nov, year =. Fantastic. doi:10.48550/arXiv.2511.16842 , abstract =

  13. [13]

    The American Journal of Psychology , author =

    ". The American Journal of Psychology , author =. 1904 , pages =. doi:10.2307/1412107 , number =

  14. [14]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , month = dec, year =. Judging. doi:10.48550/arXiv.2306.05685 , abstract =

  15. [15]

    Language-agnostic BERT sentence embedding, 2022

    Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , month = mar, year =. Language-agnostic. doi:10.48550/arXiv.2007.01852 , abstract =

  16. [16]

    doi:10.48550/arXiv.2508.12733 , abstract =

    Ning, Zhiyuan and Gu, Tianle and Song, Jiaxin and Hong, Shixin and Li, Lingyu and Liu, Huacan and Li, Jie and Wang, Yixu and Lingyu, Meng and Teng, Yan and Wang, Yingchun , month = aug, year =. doi:10.48550/arXiv.2508.12733 , abstract =

  17. [17]

    Educational and Psychological Measurement , author =

    Little. Educational and Psychological Measurement , author =. 1974 , pages =. doi:10.1177/001316447403400115 , language =

  18. [18]

    Multilingualjailbreakchallengesinlargelanguagemodels

    Deng, Yue and Zhang, Wenxuan and Pan, Sinno Jialin and Bing, Lidong , month = mar, year =. Multilingual. doi:10.48550/arXiv.2310.06474 , abstract =

  19. [19]

    Spiliopoulou, Evangelia and Fogliato, Riccardo and Burnsky, Hanna and Soliman, Tamer and Ma, Jie and Horwood, Graham and Ballesteros, Miguel , month = aug, year =. Play. doi:10.48550/arXiv.2508.06709 , abstract =

  20. [20]

    Pyro: Deep Universal Probabilistic Programming

    Bingham, Eli and Chen, Jonathan P. and Jankowiak, Martin and Obermeyer, Fritz and Pradhan, Neeraj and Karaletsos, Theofanis and Singh, Rohit and Szerlip, Paul and Horsfall, Paul and Goodman, Noah D. , month = oct, year =. Pyro:. doi:10.48550/arXiv.1810.09538 , abstract =

  21. [21]

    Wang, Xinpeng and Wang, Mingyang and Liu, Yihong and Schütze, Hinrich and Plank, Barbara , month = feb, year =. Refusal. doi:10.48550/arXiv.2505.17306 , abstract =

  22. [22]

    doi:10.4324/9780203357811 , url =

    Differential Item Functioning , year =. doi:10.4324/9780203357811 , url =

  23. [23]

    Kendall,M. G. , title =. 1948 , pages =

  24. [24]

    Refusal in Language Models Is Mediated by a Single Direction

    Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , month = oct, year =. Refusal in. doi:10.48550/arXiv.2406.11717 , abstract =

  25. [25]

    Reliable and

    Truong, Sang and Tu, Yuheng and Liang, Percy and Li, Bo and Koyejo, Sanmi , month = mar, year =. Reliable and. doi:10.48550/arXiv.2503.13335 , abstract =

  26. [26]

    Safer or

    Chen, Hongyu and Goldfarb-Tarrant, Seraphina , month = jul, year =. Safer or. doi:10.48550/arXiv.2503.09347 , abstract =

  27. [27]

    , year =

    Baker, Frank B. , year =. The

  28. [28]

    Wollschläger, Tom and Elstner, Jannes and Geisler, Simon and Cohen-Addad, Vincent and Günnemann, Stephan and Gasteiger, Johannes , month = feb, year =. The. doi:10.48550/arXiv.2502.17420 , abstract =

  29. [29]

    Pan, Wenbo and Liu, Zhichao and Chen, Qiguang and Zhou, Xiangyang and Yu, Haining and Jia, Xiaohua , month = may, year =. The. doi:10.48550/arXiv.2502.09674 , abstract =

  30. [30]

    Larsen, Erik , month = dec, year =. The. doi:10.48550/arXiv.2512.12066 , abstract =

  31. [31]

    Journal of the American Statistical Association , year =

    Variational. Journal of the American Statistical Association , author =. 2017 , pages =. doi:10.1080/01621459.2017.1285773 , language =

  32. [32]

    What is in

    He, Luxi and Xia, Mengzhou and Henderson, Peter , month = aug, year =. What is in. doi:10.48550/arXiv.2404.01099 , abstract =

  33. [33]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F

    Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F. T. , month = oct, year =. doi:10.48550/arXiv.2310.10482 , abstract =