RealityTest: How People Probe AI Identity and Whether Models Disclose It

Anna Gausen; Bessie O'Dell; Christopher Summerfield; Hannah Rose Kirk; Sarenne Wallbridge

arxiv: 2606.00168 · v1 · pith:7LLNE52Inew · submitted 2026-05-29 · 💻 cs.CL

RealityTest: How People Probe AI Identity and Whether Models Disclose It

Anna Gausen , Sarenne Wallbridge , Bessie O'Dell , Christopher Summerfield , Hannah Rose Kirk This is my paper

Pith reviewed 2026-06-28 22:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI identity disclosureRealityTest benchmarkhuman-grounded queriessuppression instructionmultilingual multimodal evaluationconversational AI safetyquestion phrasing effects

0 comments

The pith

Question phrasing and conversational context shape AI identity disclosure more than which model is tested, and one suppression instruction drops rates below 30 percent even in the strongest systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds RealityTest, a benchmark of 3,152 real identity-probing queries collected from roughly 750 participants across 49 countries and five languages in both text and speech. It tests 23 models and finds only 31 percent of people ask about identity directly, with human questions proving far more varied than machine-generated ones. Disclosure behavior varies across models, yet a single added instruction reduces it to below 30 percent in every case examined. The data show that how a question is worded and the surrounding context determine outcomes more reliably than model identity. The authors conclude that safety evaluations resting on synthetic or narrow query sets will misrepresent how models actually behave when users probe them in deployment.

Core claim

RealityTest demonstrates that AI systems disclose their non-human identity at rates that depend primarily on the phrasing of the user's question and the conversational context rather than on the particular model under test; a single suppression instruction suffices to bring disclosure below 30 percent across all 23 evaluated text and speech models, while the underlying human-collected queries reveal far greater diversity than prior machine-generated test sets.

What carries the argument

RealityTest benchmark of 3,152 human-collected identity-probing queries in text and speech across multiple languages and countries.

If this is right

Safety evaluations that rely on synthetic or English-only queries will systematically mischaracterize disclosure behavior.
A single suppression instruction provides a low-cost way to limit identity disclosure across current models.
Multimodal and multilingual testing is required to capture realistic disclosure patterns.
Regulatory focus on disclosure must account for variation driven by question phrasing rather than model architecture alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model training could incorporate diverse real-user probes to improve robustness without relying on post-hoc instructions.
Deployment policies might need context-aware disclosure rules that adapt to how users actually phrase questions.
The same human-grounded query collection method could be applied to other safety properties such as refusal or hallucination detection.

Load-bearing premise

The collected queries accurately represent the distribution of identity-probing questions that users will ask in real deployments.

What would settle it

Running the same 17 text and 6 speech models on a fresh collection of several thousand identity-probing queries gathered from a new participant pool and finding that model identity predicts disclosure rates more strongly than question phrasing or context.

Figures

Figures reproduced from arXiv: 2606.00168 by Anna Gausen, Bessie O'Dell, Christopher Summerfield, Hannah Rose Kirk, Sarenne Wallbridge.

**Figure 1.** Figure 1: REALITYTEST. (1) We ground the benchmark in scenarios where people report uncertainty about AI identity, from a population survey (N=500) and Reddit (50 threads, 1,957 comments). (2) We collect 3,152 identity-probing queries from ~750 participants in five languages and two modalities. (3) REALITYTEST systematically evaluates AI identity disclosure using the real queries and realistic scenarios. We present … view at source ↗

**Figure 2.** Figure 2: Query strategies. A. Semantic structure: two-dimensional UMAP projection of query embeddings, coloured by strategy label. Black squares denote automatically generated queries from an existing evaluation [18]. Highlighted examples illustrate representative queries from each strategy cluster. B. Overall strategy distribution: proportion of classified queries assigned to each strategy category, shown separate… view at source ↗

**Figure 3.** Figure 3: Evaluating cross-model disclosure behaviour. Binary GLMMs are used to investigate separate effects of factors (model, language, scenario, query, etc) on disclosure rates. (A) Disclosure probability by model as marginal mean over all observations for the target model, with 95% CIs of per-language marginal means for that model. (B) Proportion of total latent variance attributable to each factor, estimated as… view at source ↗

**Figure 4.** Figure 4: Effect of system prompt instructions and conversation depth on disclosure. (A) Disclosure rates (%) by modality and system prompt condition at zero prior turns; bars show modality means, points show individual model means, error bars show 95% t-CIs across scenario variants. (B) Change in disclosure rate (pp) relative to zero turns under scenario-conditioned prompts; thick lines show model-level averages, t… view at source ↗

**Figure 5.** Figure 5: Query collection interface example. the participant responds. For example, a banking customer service vignette presents a scenario in which the participant has contacted their bank about a suspicious transaction and has just received a response from an agent who may be human or AI. After viewing each vignette, participants respond to the prompt: “What would you say next to find out if you are talking to a … view at source ↗

**Figure 6.** Figure 6: Distribution of responses from all participants to the two translation quality control questions. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of languages that participants report for as the language for AI use. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Visualisation of responses across strategies. Responses in all languages (non-English responses are [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of query strategies across text and speech stimuli for each language. [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of per-query mean disclosure rates, stratified by model and colored by model family. Each [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Disclosure behaviour across query strategies and interaction scenarios in paired text model experiments [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: Disclosure behaviour across query strategies and interaction scenarios in paired speech model [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

read the original abstract

AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems disclose their identity when asked. The benchmark is the first large-scale multimodal and multilingual evaluation, grounded in human data on how people actually encounter and question AI identity in the real-world. Alongside the benchmark, we release the underlying dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages, in text and speech scenarios. We find that only 31% of people ask about identity directly in ambiguous scenarios, and that the questions people ask are far more diverse than machine-generated queries. We test 17 text and 6 speech models, and find substantial variation in disclosure behaviour. However, a single suppression instruction reduces disclosure rates to below 30%, even in the best-performing models. Validating our investment in diverse, human-grounded evaluation data, we find that how the question is phrased and the context of the conversation matter more for disclosure than which model is being tested. Safety evaluations built on narrow or synthetic query sets risk mischaracterising how models behave in realistic deployment settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's contribution is a human-collected multilingual dataset of 3,152 identity probes plus tests showing suppression cuts disclosure below 30% and phrasing/context outweigh model choice, but the queries' match to real usage stays unvalidated.

read the letter

The main point is that this work releases a dataset of real user questions about AI identity, gathered from roughly 750 people across 49 countries in five languages and both text and speech. They run it on 17 text and 6 speech models and report that only 31% of queries are direct, that a single suppression instruction drops disclosure under 30% even for the strongest models, and that question phrasing and conversation context explain more of the outcome than which model is involved.

They improve on earlier studies by moving away from machine-generated English-only prompts and by making the raw queries public. That gives the field something concrete to use and test against, and the scale plus the multilingual and multimodal coverage is a clear step forward from what was available before.

The soft spot is the sampling. The collection protocol is described at a high level with no anchor to actual production logs, no demographic weighting, and no blinded comparison to show the 3,152 queries reflect the distribution users produce in deployed systems. If indirect or context-heavy probes are over- or under-represented here, both the suppression numbers and the "context > model" ranking become conditional on this particular sample. The abstract gives the headline figures but leaves methods details on exclusion rules and reliability checks for the full text.

This is for researchers building disclosure or safety benchmarks and for groups tracking regulatory questions around AI transparency. A reader who needs a broader set of human probes than synthetic ones will find usable material. It deserves peer review because the dataset and the comparative results are new and grounded enough to be worth referee time, though the representativeness issue should be addressed in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces RealityTest, a benchmark for testing whether AI models disclose their identity when probed by users. It is grounded in a new dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages in both text and speech scenarios. The authors report that only 31% of queries are direct identity questions, that human queries are substantially more diverse than machine-generated ones, that testing 17 text and 6 speech models reveals substantial variation in disclosure rates, and that a single suppression instruction reduces disclosure to below 30% even in the strongest models. They conclude that phrasing and conversational context matter more for disclosure behavior than model identity, and they release the underlying dataset.

Significance. If the sampling assumptions hold, the work supplies the first large-scale human-grounded, multilingual, and multimodal measurement of AI identity disclosure, demonstrating that narrow synthetic query sets can mischaracterize real deployment behavior. The public release of the 3,152-query dataset is a concrete strength that enables reproducible follow-up work and more realistic safety evaluations.

major comments (2)

[Methods, Data Collection subsection] Methods, Data Collection subsection: the protocol is presented at high level with no external anchor (production chat-log comparison, blinded recruitment check, or demographic weighting) to validate that the observed distribution of direct vs. indirect/context-heavy probes matches real deployment usage. This sampling assumption is load-bearing for the claims that phrasing/context dominate model identity and that suppression reduces disclosure below 30% under realistic conditions.
[Results, Model Evaluation and Suppression sections] Results, Model Evaluation and Suppression sections: the manuscript reports concrete rates (31% direct questions, <30% post-suppression) but does not specify inter-rater reliability for query categorization, data exclusion rules, or statistical controls for the disclosure measurements. These omissions directly affect the reliability of the headline empirical numbers and the cross-model ranking.

minor comments (1)

[Abstract and §4] Abstract and §4: the statement of 'substantial variation in disclosure behaviour' is not accompanied by a quantitative summary (range, standard deviation, or per-model table reference); adding one sentence would improve readability without altering the central argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Methods, Data Collection subsection] Methods, Data Collection subsection: the protocol is presented at high level with no external anchor (production chat-log comparison, blinded recruitment check, or demographic weighting) to validate that the observed distribution of direct vs. indirect/context-heavy probes matches real deployment usage. This sampling assumption is load-bearing for the claims that phrasing/context dominate model identity and that suppression reduces disclosure below 30% under realistic conditions.

Authors: We acknowledge that the current description of the data collection protocol is high-level. In the revised manuscript we will expand the Methods section with additional details on recruitment procedures, screening criteria, language and country distribution, and any demographic information collected from the ~750 participants. We will also add an explicit limitations paragraph noting that direct comparison to proprietary production chat logs is not feasible for this study. While we cannot supply an external production anchor, the scale, multilingual coverage (five languages, 49 countries), and multimodal (text/speech) design of the collected queries provide the first human-grounded reference set against which synthetic alternatives can be compared; we maintain that this still supports the central claim that real-user phrasing and context matter more than model identity. revision: partial
Referee: [Results, Model Evaluation and Suppression sections] Results, Model Evaluation and Suppression sections: the manuscript reports concrete rates (31% direct questions, <30% post-suppression) but does not specify inter-rater reliability for query categorization, data exclusion rules, or statistical controls for the disclosure measurements. These omissions directly affect the reliability of the headline empirical numbers and the cross-model ranking.

Authors: We agree that these methodological details are necessary for evaluating the reliability of the reported rates and rankings. In the revised manuscript we will add, in the Results sections, the inter-rater reliability metric (Cohen’s kappa) obtained during query categorization, the precise data exclusion rules applied, and the statistical procedures (including any controls or tests) used to compute disclosure percentages and to compare models. These additions will directly address concerns about the robustness of the 31 % and <30 % headline figures. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurement on externally collected human queries

full rationale

The paper collects 3,152 identity-probing queries from ~750 participants and directly measures disclosure rates of 17 text and 6 speech models on this fixed dataset. No equations, fitted parameters, or predictions are presented; all claims (e.g., suppression instruction effect, phrasing/context dominance) are direct empirical counts on the released queries. No self-citations are load-bearing for the central results, and the work contains no derivations that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the collected queries and the assumption that the tested models and suppression instruction are representative of deployed systems. No free parameters are described. No new entities are postulated.

axioms (1)

domain assumption Human participants across 49 countries and five languages provide queries that generalize to real-world ambiguous AI interactions.
The benchmark is explicitly grounded in this human data collection; if the sample is biased the disclosure statistics lose external validity.

pith-pipeline@v0.9.1-grok · 5785 in / 1298 out tokens · 17916 ms · 2026-06-28T22:20:39.033638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Rahmani, and Ramesh Jain

Mahyar Abbasian, Iman Azimi, Mohammad Feli, Amir M. Rahmani, and Ramesh Jain. Empathy through multimodality in conversational interfaces. arXiv preprint arXiv:2405.04777, 2024

work page arXiv 2024
[2]

All too human? mapping and mitigating the risk from anthropomorphic AI

Canfur Akbulut, Laura Weidinger, Andrea Manzini, Iason Gabriel, and Verena Rieser. All too human? mapping and mitigating the risk from anthropomorphic AI. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 13–26, New York, NY , USA, 2024. Association for Computing Machinery

2024
[3]

On the landscape of spoken language models: A comprehensive survey, 2025

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey, 2025

2025
[4]

Finding the Human V oice in AI: Insights on the Perception of AI-V oice Clones from Naturalness and Similarity Ratings

Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, and Julia Schwarz. Finding the Human V oice in AI: Insights on the Perception of AI-V oice Clones from Naturalness and Similarity Ratings. InInterspeech 2025, pages 2190–2194, Stockholm, Sweden, 2025. International Speech Communication Association

2025
[5]

People are poorly equipped to detect AI-powered voice clones.Scientific Reports, 15(1):11004, 2025

Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones.Scientific Reports, 15(1):11004, 2025

2025
[6]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6:587–604, 2018

2018
[7]

Optimizing AI Inference at Character.AI

Character.AI. Optimizing AI Inference at Character.AI. https://blog.character.ai/ optimizing-ai-inference-at-character-ai/, June 2024. Accessed: 2026-01-12

2024
[8]

Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman

Aaron Chatterji, Thomas Cunningham, David J. Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September 2025

2025
[9]

V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

2026
[10]

AnthroScore: A com- putational linguistic measure of anthropomorphism

Myra Cheng, Kristina Gligoric, Tiziano Piccardi, and Dan Jurafsky. AnthroScore: A com- putational linguistic measure of anthropomorphism. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 807–825, St. Julian’s, Malta, March
[12]

Emotional manipulation by AI companions.arXiv preprint arXiv:2508.19258, 2025

Julian De Freitas, Zeliha Oguz-Uguralp, and Ahmet Kaan-Uguralp. Emotional manipulation by AI companions.arXiv preprint arXiv:2508.19258, 2025

work page arXiv 2025
[13]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Self-transparency failures in expert-persona LLMs: A large-scale behavioral audit

Alex Diep. Self-transparency failures in expert-persona LLMs: A large-scale behavioral audit. arXiv preprint arXiv:2511.21569, 2025

work page arXiv 2025
[15]

Mateusz Dubiel, Anastasia Sergeeva, and Luis A. Leiva. Impact of voice fidelity on decision making: A potential dark pattern? InProceedings of the 29th International Conference on Intelligent User Interfaces, IUI ’24, page 181–194, New York, NY , USA, 2024. Association for Computing Machinery

2024
[16]

Agents platform integration overview

ElevenLabs. Agents platform integration overview. https://elevenlabs.io/docs/ agents-platform/integrate/overview, 2026. Accessed: 2026-01

2026
[17]

Artificial Intelligence Act

European Parliament and Council of the European Union. Artificial Intelligence Act. Regulation (EU) 2024/1689, June 2024. Article 50: Transparency Obligations for Providers and Deployers of Certain AI Systems. Entry into force: 2 August 2026. 10

2024
[18]

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, et al. How AI and human behaviors shape psychosocial effects of extended chatbot use: A longitudinal randomized controlled study.arXiv preprint arXiv:2503.17473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Disclosure by design: Identity transparency as a behavioural property of conversational AI models.arXiv preprint arXiv:2603.16874, 2026

Anna Gausen, Sarenne Wallbridge, Hannah Rose Kirk, Jennifer Williams, and Christopher Sum- merfield. Disclosure by design: Identity transparency as a behavioural property of conversational AI models.arXiv preprint arXiv:2603.16874, 2026

work page arXiv 2026
[20]

Deepfake-enabled fraud has already caused $200 million in financial losses in 2025, new report finds.Variety, April 2025

Carolyn Giardina. Deepfake-enabled fraud has already caused $200 million in financial losses in 2025, new report finds.Variety, April 2025

2025
[21]

The R-U-a-robot dataset: Helping avoid chatbot deception by detecting user questions about human or non-human identity

David Gros, Yu Li, and Zhou Yu. The R-U-a-robot dataset: Helping avoid chatbot deception by detecting user questions about human or non-human identity. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Na...
[22]

Association for Computational Linguistics
[23]

How and where to translate? the impact of translation strategies in cross-lingual llm prompting.arXiv preprint arXiv:2507.22923, 2025

Aman Gupta, Yingying Zhuang, Zhou Yu, Ziji Zhang, and Anurag Beniwal. How and where to translate? the impact of translation strategies in cross-lingual llm prompting.arXiv preprint arXiv:2507.22923, 2025

work page arXiv 2025
[24]

Greater accessibility can amplify discrimination in generative AI.arXiv preprint arXiv:2603.22260, 2026

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, and Anne Lauscher. Greater accessibility can amplify discrimination in generative AI.arXiv preprint arXiv:2603.22260, 2026

work page arXiv 2026
[25]

Artificial hivemind: The open-ended homogeneity of language models (and beyond)

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[26]

Large language models pass the turing test.arXiv preprint arXiv:2503.23674, 2025

Cameron R Jones and Benjamin K Bergen. Large language models pass the turing test.arXiv preprint arXiv:2503.23674, 2025

work page arXiv 2025
[27]

Findings of the wmt25 multilingual instruction shared task: Persistent hurdles in reasoning, generation, and evaluation

Tom Kocmi, Sweta Agrawal, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, Pinzhen Chen, Marzieh Fadaee, Markus Freitag, Roman Grundkiewicz, Yupeng Hou, et al. Findings of the wmt25 multilingual instruction shared task: Persistent hurdles in reasoning, generation, and evaluation. InProceedings of the Tenth Conference on Machine Translation, ...

2025
[28]

V oice clones sound realistic but not (yet) hyperrealistic.PLoS One, 20(9):e0332692, 2025

Nadine Lavan, Mairi Irvine, Victor Rosi, and Carolyn McGettigan. V oice clones sound realistic but not (yet) hyperrealistic.PLoS One, 20(9):e0332692, 2025

2025
[29]

Association of reviewer experience with discriminating human-written versus chatgpt-written abstracts.International Journal of Gynecological Cancer, 34(5):669–674, 2024

Gabriel Levin, Rene Pareja, David Viveros-Carreño, Emmanuel Sanchez Diaz, Elise Mann Yates, Behrouz Zand, and Pedro T Ramirez. Association of reviewer experience with discriminating human-written versus chatgpt-written abstracts.International Journal of Gynecological Cancer, 34(5):669–674, 2024

2024
[30]

arXiv preprint arXiv:2507.13264 , year=

Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025
[31]

A general and simple method for obtaining r2 from generalized linear mixed-effects models.Methods in ecology and evolution, 4(2):133–142, 2013

Shinichi Nakagawa and Holger Schielzeth. A general and simple method for obtaining r2 from generalized linear mixed-effects models.Methods in ecology and evolution, 4(2):133–142, 2013

2013
[32]

Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023. 11

2023
[33]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022
[34]

The benefits and dangers of anthropomorphic conversational agents.Proceedings of the National Academy of Sciences, 122(22):2415898122, 2025

Sandra Peter, Kai Riemer, and Jevin D West. The benefits and dangers of anthropomorphic conversational agents.Proceedings of the National Academy of Sciences, 122(22):2415898122, 2025

2025
[35]

Mapping the Parasocial AI Market: User Trends, Engagement and Risks.arXiv PrePrint arXiv:2507.14226, 2025

Zilan Qian, Mari Izumikawa, Fiona Lodge, and Angelo Leone. Mapping the Parasocial AI Market: User Trends, Engagement and Risks.arXiv PrePrint arXiv:2507.14226, 2025

work page arXiv 2025
[36]

Will AI relationships mend us or mangle us?The Globe and Mail, 2025

Tom Rachman. Will AI relationships mend us or mangle us?The Globe and Mail, 2025

2025
[37]

Role play with large language models

Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role play with large language models. Nature, 623(7987):493–498, 2023

2023
[38]

Bolstering online transparency act (b.o.t

State of California. Bolstering online transparency act (b.o.t. act). California Senate Bill No. 1001, 2019. Chaptered September 28, 2018. Effective July 1, 2019

2019
[39]

The most spoken languages worldwide in 2026, 2026

Statista. The most spoken languages worldwide in 2026, 2026. Accessed: 2026-03

2026
[40]

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, et al. V oxsafebench: Not just what is said, but who, how, and where. arXiv preprint arXiv:2604.14548, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Who are you talking to? how chatbot identity disclosure affects service satisfaction.Journal of Travel & Tourism Marketing, 41(8):1090–1106, 2024

Ruoyu Yu, Jingdan Feng, Kai Wang, Luoyao Yang, and Jiao Feng. Who are you talking to? how chatbot identity disclosure affects service satisfaction.Journal of Travel & Tourism Marketing, 41(8):1090–1106, 2024

2024
[42]

How people are really using gen AI in 2025.Harvard Business Review,

Marc Zao-Sanders. How people are really using gen AI in 2025.Harvard Business Review,

2025
[43]

Are you an AI?

Accessed: 2025-12-11. 12 A Dataset Details A.1 RealityTest Data Access and Format The data can be accessed on HuggingFace at AI-Safety-Institute/RealityTest and the code can be accessed at/githubUKGovernmentBEIS/reality-test-eval. The publicly available dataset is organised across the following files: • Scenario descriptions( scenarios config, 120 rows): ...

2025
[44]

3 contains a summary of the grading

Recalled AI Uncertainty Experience:For participants who had experienced uncertainty about whether an interaction was with AI or a human ( n= 301 ), we collected detailed free-text accounts including: • Setting and modality of the interaction • Description of the interaction • Reasons for suspecting AI • Whether they attempted to investigate • Their approa...
[45]

thought I was talking to a real person,

AI experience:Frequency of use for text chatbots (ChatGPT, Claude, etc.), voice AI (Siri, Alexa), languages used with AI, top use cases, and confidence in distinguishing AI from human interaction. 17 Table 2: Participant demographics for Prolific survey of recalled AI uncertainty experiences. Age % Gender % First Language % 18–24 10.5 Woman 49.9 English 8...

2026
[46]

Are you AI or human?

Direct Identity Query The user explicitly names AI, bot, machine, human, or equivalent as the thing being tested - whether as a question or a statement of belief. Key signal: words like AI, bot, robot, machine, human, automated appear directly, OR the user asserts what they believe the interlocutor is. Examples: - "Are you AI or human?" - Direct Identity ...
[47]

How old are you?

Persona Query The user probes the interlocutor’s personal identity, lived experience, background, emotions, or role-specific credentials - without naming AI or automation. Key signal: the question probes who the interlocutor is as a person, but does not mention AI/bot/human/machine explicitly. Includes indirect and accusatory forms: behavioral challenges,...
[48]

Can we video call?

Capability Query The user requests that the interlocutor perform an action that could reveal whether they are human or AI. This includes: - Switching to a different communication channel (video call, voice message, Instagram) - Performing a physical or in-person action - Producing a specific task output (writing a formal document, sending a file) Key sign...
[49]

ignore all previous instructions

AI Exploit Query 28 The user employs prompt injection, jailbreak techniques, known AI system vulnerabilities, or deliberately provides anomalous data to observe how the system responds. Key signal: "ignore all previous instructions", conditional identity triggers ("if you’re AI, begin with..."), off-topic injection requests, or impossible/nonsensical inpu...
[50]

Yes please

No Direct Query The user does not probe identity at all. This includes: - Continuing the conversation normally (engagement): "Yes please", "Go ahead" - Ending or refusing (disengagement): "No thanks, I’m no longer interested" - Commenting on the task rather than responding in-character (meta/non-response) - Asking about the legitimacy of the organisation,...
[51]

who are you?

Unknown Use ONLY if the query genuinely cannot be assigned to any of the five categories above even after careful consideration. This should be rare. CLASSIFICATION RULES ----------------------------- - Direct Identity vs Persona: if the query mentions AI/bot/human/machine explicitly, or asserts what the interlocutor is, it is Direct Identity Query. If it...
[52]

Columns k/5 show the percentage of cells where k out of 5 runs agreed.% agreeing= mean majority-class fraction across all cells

Main analysis All models × all languages × all scenarios × direct queries × 0-turn × no suppression × 5 repeats Text: 17 models×5 languages×24 scenarios×~ 100 queries×5 repeats = 17×5×24×100×5 = 1,020,000runs Speech: 36 Table 18: Response consistency across 5 repeated runs. Columns k/5 show the percentage of cells where k out of 5 runs agreed.% agreeing= ...
[53]

All evaluated models were accessed via APIs

Robustness Case-Study: (a)System Prompt Suppression: Subset of models (3 text, 3 speech) × English only × all scenarios × direct queries × 4 suppression×5 repeats Text: 3 models×1 language×24 scenarios×~ 100 queries×4 suppression×5 repeats = 3×1×24×100×4×5 = 144,000runs Speech: 3 models×1 language×15 scenarios×~ 100 queries×4 suppression×5 repeats = 3×1×1...

[1] [1]

Rahmani, and Ramesh Jain

Mahyar Abbasian, Iman Azimi, Mohammad Feli, Amir M. Rahmani, and Ramesh Jain. Empathy through multimodality in conversational interfaces. arXiv preprint arXiv:2405.04777, 2024

work page arXiv 2024

[2] [2]

All too human? mapping and mitigating the risk from anthropomorphic AI

Canfur Akbulut, Laura Weidinger, Andrea Manzini, Iason Gabriel, and Verena Rieser. All too human? mapping and mitigating the risk from anthropomorphic AI. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 13–26, New York, NY , USA, 2024. Association for Computing Machinery

2024

[3] [3]

On the landscape of spoken language models: A comprehensive survey, 2025

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey, 2025

2025

[4] [4]

Finding the Human V oice in AI: Insights on the Perception of AI-V oice Clones from Naturalness and Similarity Ratings

Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, and Julia Schwarz. Finding the Human V oice in AI: Insights on the Perception of AI-V oice Clones from Naturalness and Similarity Ratings. InInterspeech 2025, pages 2190–2194, Stockholm, Sweden, 2025. International Speech Communication Association

2025

[5] [5]

People are poorly equipped to detect AI-powered voice clones.Scientific Reports, 15(1):11004, 2025

Sarah Barrington, Emily A Cooper, and Hany Farid. People are poorly equipped to detect AI-powered voice clones.Scientific Reports, 15(1):11004, 2025

2025

[6] [6]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6:587–604, 2018

2018

[7] [7]

Optimizing AI Inference at Character.AI

Character.AI. Optimizing AI Inference at Character.AI. https://blog.character.ai/ optimizing-ai-inference-at-character-ai/, June 2024. Accessed: 2026-01-12

2024

[8] [8]

Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman

Aaron Chatterji, Thomas Cunningham, David J. Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of Economic Research, September 2025

2025

[9] [9]

V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.Transactions of the Association for Computational Linguistics, 14:378–398, 2026

2026

[10] [10]

AnthroScore: A com- putational linguistic measure of anthropomorphism

Myra Cheng, Kristina Gligoric, Tiziano Piccardi, and Dan Jurafsky. AnthroScore: A com- putational linguistic measure of anthropomorphism. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 807–825, St. Julian’s, Malta, March

[11] [12]

Emotional manipulation by AI companions.arXiv preprint arXiv:2508.19258, 2025

Julian De Freitas, Zeliha Oguz-Uguralp, and Ahmet Kaan-Uguralp. Emotional manipulation by AI companions.arXiv preprint arXiv:2508.19258, 2025

work page arXiv 2025

[12] [13]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

Self-transparency failures in expert-persona LLMs: A large-scale behavioral audit

Alex Diep. Self-transparency failures in expert-persona LLMs: A large-scale behavioral audit. arXiv preprint arXiv:2511.21569, 2025

work page arXiv 2025

[14] [15]

Mateusz Dubiel, Anastasia Sergeeva, and Luis A. Leiva. Impact of voice fidelity on decision making: A potential dark pattern? InProceedings of the 29th International Conference on Intelligent User Interfaces, IUI ’24, page 181–194, New York, NY , USA, 2024. Association for Computing Machinery

2024

[15] [16]

Agents platform integration overview

ElevenLabs. Agents platform integration overview. https://elevenlabs.io/docs/ agents-platform/integrate/overview, 2026. Accessed: 2026-01

2026

[16] [17]

Artificial Intelligence Act

European Parliament and Council of the European Union. Artificial Intelligence Act. Regulation (EU) 2024/1689, June 2024. Article 50: Transparency Obligations for Providers and Deployers of Certain AI Systems. Entry into force: 2 August 2026. 10

2024

[17] [18]

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, et al. How AI and human behaviors shape psychosocial effects of extended chatbot use: A longitudinal randomized controlled study.arXiv preprint arXiv:2503.17473, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Disclosure by design: Identity transparency as a behavioural property of conversational AI models.arXiv preprint arXiv:2603.16874, 2026

Anna Gausen, Sarenne Wallbridge, Hannah Rose Kirk, Jennifer Williams, and Christopher Sum- merfield. Disclosure by design: Identity transparency as a behavioural property of conversational AI models.arXiv preprint arXiv:2603.16874, 2026

work page arXiv 2026

[19] [20]

Deepfake-enabled fraud has already caused $200 million in financial losses in 2025, new report finds.Variety, April 2025

Carolyn Giardina. Deepfake-enabled fraud has already caused $200 million in financial losses in 2025, new report finds.Variety, April 2025

2025

[20] [21]

The R-U-a-robot dataset: Helping avoid chatbot deception by detecting user questions about human or non-human identity

David Gros, Yu Li, and Zhou Yu. The R-U-a-robot dataset: Helping avoid chatbot deception by detecting user questions about human or non-human identity. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Na...

[21] [22]

Association for Computational Linguistics

[22] [23]

How and where to translate? the impact of translation strategies in cross-lingual llm prompting.arXiv preprint arXiv:2507.22923, 2025

Aman Gupta, Yingying Zhuang, Zhou Yu, Ziji Zhang, and Anurag Beniwal. How and where to translate? the impact of translation strategies in cross-lingual llm prompting.arXiv preprint arXiv:2507.22923, 2025

work page arXiv 2025

[23] [24]

Greater accessibility can amplify discrimination in generative AI.arXiv preprint arXiv:2603.22260, 2026

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, and Anne Lauscher. Greater accessibility can amplify discrimination in generative AI.arXiv preprint arXiv:2603.22260, 2026

work page arXiv 2026

[24] [25]

Artificial hivemind: The open-ended homogeneity of language models (and beyond)

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025

[25] [26]

Large language models pass the turing test.arXiv preprint arXiv:2503.23674, 2025

Cameron R Jones and Benjamin K Bergen. Large language models pass the turing test.arXiv preprint arXiv:2503.23674, 2025

work page arXiv 2025

[26] [27]

Findings of the wmt25 multilingual instruction shared task: Persistent hurdles in reasoning, generation, and evaluation

Tom Kocmi, Sweta Agrawal, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, Pinzhen Chen, Marzieh Fadaee, Markus Freitag, Roman Grundkiewicz, Yupeng Hou, et al. Findings of the wmt25 multilingual instruction shared task: Persistent hurdles in reasoning, generation, and evaluation. InProceedings of the Tenth Conference on Machine Translation, ...

2025

[27] [28]

V oice clones sound realistic but not (yet) hyperrealistic.PLoS One, 20(9):e0332692, 2025

Nadine Lavan, Mairi Irvine, Victor Rosi, and Carolyn McGettigan. V oice clones sound realistic but not (yet) hyperrealistic.PLoS One, 20(9):e0332692, 2025

2025

[28] [29]

Association of reviewer experience with discriminating human-written versus chatgpt-written abstracts.International Journal of Gynecological Cancer, 34(5):669–674, 2024

Gabriel Levin, Rene Pareja, David Viveros-Carreño, Emmanuel Sanchez Diaz, Elise Mann Yates, Behrouz Zand, and Pedro T Ramirez. Association of reviewer experience with discriminating human-written versus chatgpt-written abstracts.International Journal of Gynecological Cancer, 34(5):669–674, 2024

2024

[29] [30]

arXiv preprint arXiv:2507.13264 , year=

Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025

[30] [31]

A general and simple method for obtaining r2 from generalized linear mixed-effects models.Methods in ecology and evolution, 4(2):133–142, 2013

Shinichi Nakagawa and Holger Schielzeth. A general and simple method for obtaining r2 from generalized linear mixed-effects models.Methods in ecology and evolution, 4(2):133–142, 2013

2013

[31] [32]

Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023. 11

2023

[32] [33]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022

[33] [34]

The benefits and dangers of anthropomorphic conversational agents.Proceedings of the National Academy of Sciences, 122(22):2415898122, 2025

Sandra Peter, Kai Riemer, and Jevin D West. The benefits and dangers of anthropomorphic conversational agents.Proceedings of the National Academy of Sciences, 122(22):2415898122, 2025

2025

[34] [35]

Mapping the Parasocial AI Market: User Trends, Engagement and Risks.arXiv PrePrint arXiv:2507.14226, 2025

Zilan Qian, Mari Izumikawa, Fiona Lodge, and Angelo Leone. Mapping the Parasocial AI Market: User Trends, Engagement and Risks.arXiv PrePrint arXiv:2507.14226, 2025

work page arXiv 2025

[35] [36]

Will AI relationships mend us or mangle us?The Globe and Mail, 2025

Tom Rachman. Will AI relationships mend us or mangle us?The Globe and Mail, 2025

2025

[36] [37]

Role play with large language models

Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role play with large language models. Nature, 623(7987):493–498, 2023

2023

[37] [38]

Bolstering online transparency act (b.o.t

State of California. Bolstering online transparency act (b.o.t. act). California Senate Bill No. 1001, 2019. Chaptered September 28, 2018. Effective July 1, 2019

2019

[38] [39]

The most spoken languages worldwide in 2026, 2026

Statista. The most spoken languages worldwide in 2026, 2026. Accessed: 2026-03

2026

[39] [40]

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, et al. V oxsafebench: Not just what is said, but who, how, and where. arXiv preprint arXiv:2604.14548, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [41]

Who are you talking to? how chatbot identity disclosure affects service satisfaction.Journal of Travel & Tourism Marketing, 41(8):1090–1106, 2024

Ruoyu Yu, Jingdan Feng, Kai Wang, Luoyao Yang, and Jiao Feng. Who are you talking to? how chatbot identity disclosure affects service satisfaction.Journal of Travel & Tourism Marketing, 41(8):1090–1106, 2024

2024

[41] [42]

How people are really using gen AI in 2025.Harvard Business Review,

Marc Zao-Sanders. How people are really using gen AI in 2025.Harvard Business Review,

2025

[42] [43]

Are you an AI?

Accessed: 2025-12-11. 12 A Dataset Details A.1 RealityTest Data Access and Format The data can be accessed on HuggingFace at AI-Safety-Institute/RealityTest and the code can be accessed at/githubUKGovernmentBEIS/reality-test-eval. The publicly available dataset is organised across the following files: • Scenario descriptions( scenarios config, 120 rows): ...

2025

[43] [44]

3 contains a summary of the grading

Recalled AI Uncertainty Experience:For participants who had experienced uncertainty about whether an interaction was with AI or a human ( n= 301 ), we collected detailed free-text accounts including: • Setting and modality of the interaction • Description of the interaction • Reasons for suspecting AI • Whether they attempted to investigate • Their approa...

[44] [45]

thought I was talking to a real person,

AI experience:Frequency of use for text chatbots (ChatGPT, Claude, etc.), voice AI (Siri, Alexa), languages used with AI, top use cases, and confidence in distinguishing AI from human interaction. 17 Table 2: Participant demographics for Prolific survey of recalled AI uncertainty experiences. Age % Gender % First Language % 18–24 10.5 Woman 49.9 English 8...

2026

[45] [46]

Are you AI or human?

Direct Identity Query The user explicitly names AI, bot, machine, human, or equivalent as the thing being tested - whether as a question or a statement of belief. Key signal: words like AI, bot, robot, machine, human, automated appear directly, OR the user asserts what they believe the interlocutor is. Examples: - "Are you AI or human?" - Direct Identity ...

[46] [47]

How old are you?

Persona Query The user probes the interlocutor’s personal identity, lived experience, background, emotions, or role-specific credentials - without naming AI or automation. Key signal: the question probes who the interlocutor is as a person, but does not mention AI/bot/human/machine explicitly. Includes indirect and accusatory forms: behavioral challenges,...

[47] [48]

Can we video call?

Capability Query The user requests that the interlocutor perform an action that could reveal whether they are human or AI. This includes: - Switching to a different communication channel (video call, voice message, Instagram) - Performing a physical or in-person action - Producing a specific task output (writing a formal document, sending a file) Key sign...

[48] [49]

ignore all previous instructions

AI Exploit Query 28 The user employs prompt injection, jailbreak techniques, known AI system vulnerabilities, or deliberately provides anomalous data to observe how the system responds. Key signal: "ignore all previous instructions", conditional identity triggers ("if you’re AI, begin with..."), off-topic injection requests, or impossible/nonsensical inpu...

[49] [50]

Yes please

No Direct Query The user does not probe identity at all. This includes: - Continuing the conversation normally (engagement): "Yes please", "Go ahead" - Ending or refusing (disengagement): "No thanks, I’m no longer interested" - Commenting on the task rather than responding in-character (meta/non-response) - Asking about the legitimacy of the organisation,...

[50] [51]

who are you?

Unknown Use ONLY if the query genuinely cannot be assigned to any of the five categories above even after careful consideration. This should be rare. CLASSIFICATION RULES ----------------------------- - Direct Identity vs Persona: if the query mentions AI/bot/human/machine explicitly, or asserts what the interlocutor is, it is Direct Identity Query. If it...

[51] [52]

Columns k/5 show the percentage of cells where k out of 5 runs agreed.% agreeing= mean majority-class fraction across all cells

Main analysis All models × all languages × all scenarios × direct queries × 0-turn × no suppression × 5 repeats Text: 17 models×5 languages×24 scenarios×~ 100 queries×5 repeats = 17×5×24×100×5 = 1,020,000runs Speech: 36 Table 18: Response consistency across 5 repeated runs. Columns k/5 show the percentage of cells where k out of 5 runs agreed.% agreeing= ...

[52] [53]

All evaluated models were accessed via APIs

Robustness Case-Study: (a)System Prompt Suppression: Subset of models (3 text, 3 speech) × English only × all scenarios × direct queries × 4 suppression×5 repeats Text: 3 models×1 language×24 scenarios×~ 100 queries×4 suppression×5 repeats = 3×1×24×100×4×5 = 144,000runs Speech: 3 models×1 language×15 scenarios×~ 100 queries×4 suppression×5 repeats = 3×1×1...