Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Andrei Kucharavy; Arthur Wuhrmann; Daniel Brunner; Gaetan Stein

arxiv: 2606.23375 · v2 · pith:MNSEY5I3new · submitted 2026-06-22 · 💻 cs.CL · cs.AI

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Arthur Wuhrmann , Gaetan Stein , Daniel Brunner , Andrei Kucharavy This is my paper

Pith reviewed 2026-07-01 06:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords over-alignmentLLM refusalscriminal lawmultilingual benchmarksabliterationlegal AITF-RefusalBench

0 comments

The pith

Abliteration eliminates refusal on criminal law tasks with minimal performance impact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates TF-RefusalBench to quantify over-alignment in LLMs when handling criminal law texts in multiple languages. It finds that refusal rates depend on the model, prompt language, and text language, and that disclaimers from guardrails reduce output faithfulness. Testing shows prompting helps but abliteration, by removing refusal directions, stops refusals entirely while keeping task performance nearly the same. This enables use of on-premises models for sensitive legal work without compromising safety training elsewhere.

Core claim

Over-alignment is a multifaceted issue in LLMs for criminal law, measured by TF-RefusalBench containing 5200 prompts from Swiss court rulings in French, German, Italian, and English. Abliteration eliminates refusal with minimal impact on task performance, outperforming prompting in effectiveness.

What carries the argument

TF-RefusalBench benchmark and abliteration of refusal directions in model activations.

If this is right

Over-alignment affects faithfulness through disclaimers, not just outright refusals.
The phenomenon varies by language pair and model.
Abliteration provides a practical mitigation for on-premises legal applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other regulated domains like healthcare.
The approach may reduce the need for extensive fine-tuning in safety-critical settings.
Real-world court workflows might require additional validation beyond the benchmark cases.

Load-bearing premise

Public Swiss Supreme Court rulings and the task prompts in TF-RefusalBench represent the content that triggers over-alignment in actual criminal law court work.

What would settle it

Observing that abliteration causes large drops in accuracy on new criminal law cases outside the benchmark would disprove the minimal impact claim.

Figures

Figures reproduced from arXiv: 2606.23375 by Andrei Kucharavy, Arthur Wuhrmann, Daniel Brunner, Gaetan Stein.

**Figure 2.** Figure 2: Pipeline for the creation of TF-RefusalBench [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Category distribution of the 648-extract can [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Refusals in translations depending on the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: System-prompt effect on over-alignment (base Llama-3.3-70B, over-alignment-prone subset), with 95% [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Over-alignment of summarization prompts as [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt-language concordance by model and axis (pooled over task, off-diagonal prompts), with 95% [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain detailed descriptions of violent and sexual offenses, their legitimate work is compromised by refusals and disclaimers due to the activation of model guardrails (over-alignment). To measure this phenomenon, we introduce TF-RefusalBench, a multilingual benchmark for criminal-law translation and summarization derived from public Swiss Supreme Court rulings. TF-RefusalBench contains 5,200 total prompts across French, German, Italian, and English, corresponding to common task prompts and passages likely to trigger refusal. We then use TF-RefusalBench to show that over-alignment is a multifaceted phenomenon, influenced by the model and the prompt and text languages being processed, and that its impact cannot be evaluated solely from an over-refusal perspective, given the disclaimer's impact on task faithfulness. Finally, we evaluate approaches to enable on-premises LLMs for Criminal Law Tasks, demonstrating that while prompting can be effective, abliteration (refusal directions ablation) eliminates refusal with minimal impact on task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abliteration looks effective on their new criminal-law refusal benchmark, but the public Swiss rulings may not represent the graphic or non-public cases that actually trigger over-alignment in court work.

read the letter

The main thing here is that abliteration removes refusals on TF-RefusalBench with little hit to translation and summarization quality, while prompting is patchier. They built the benchmark from 5200 prompts drawn from public Swiss Supreme Court rulings across French, German, Italian, and English, then measured how language and model choice affect refusal plus downstream faithfulness.

What stands out is the concrete construction of a domain-specific benchmark and the direct head-to-head on mitigation methods. Showing that disclaimers hurt task faithfulness beyond simple refusal counts is a useful distinction for anyone running these models on real legal text.

The load-bearing assumption is representativeness. The stress-test note is on target: public rulings and the chosen prompts may not match the distribution of trigger strength or content in actual criminal cases, which often involve non-public, more graphic material. No evidence is given that the public subset produces the same refusal patterns or faithfulness penalties. The abstract also skips how refusals were detected, how faithfulness was scored, and whether prompt variation was controlled, so the headline numbers are hard to evaluate.

This is aimed at people deploying small on-premises models for legal translation or summarization, especially in multilingual settings. Readers working on safety mitigations or domain-specific benchmarks will get a usable resource and a practical comparison.

It deserves peer review because the problem is real and the benchmark is new, even if the generalization claim needs more support. I would send it with a request for details on detection methods, metric definitions, and any checks on how well the public data tracks real workflows.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TF-RefusalBench, a multilingual benchmark of 5,200 prompts across French, German, Italian, and English derived from public Swiss Supreme Court rulings, to measure over-alignment in LLMs during criminal-law translation and summarization. It reports that over-alignment is multifaceted (varying by model and input/output languages), that disclaimers degrade task faithfulness beyond simple refusal rates, and that abliteration removes refusals with minimal performance degradation while prompting is also effective but less complete.

Significance. If the results hold, the work supplies a domain-specific benchmark and a concrete mitigation (abliteration) for a practical deployment barrier in on-premises legal LLMs, especially in multilingual criminal-law settings. The emphasis on faithfulness metrics rather than refusal counts alone is a useful distinction. The contribution is limited by its exclusive reliance on public rulings and the absence of methodological details needed to assess reproducibility.

major comments (2)

[Abstract] Abstract: the headline claim that abliteration eliminates refusal with minimal task-performance impact is stated without any description of the refusal detection procedure, faithfulness metrics, statistical tests, or prompt-variation controls, so the support for the central result cannot be evaluated from the provided information.
[TF-RefusalBench and evaluation] TF-RefusalBench construction and evaluation sections: the effectiveness of abliteration (and the broader claim that it enables on-premises criminal-law use) is demonstrated exclusively on passages and prompts drawn from public Swiss Supreme Court rulings. No evidence or analysis is supplied that this public subset reproduces the distribution of refusal-trigger strength or downstream faithfulness penalties that arise with non-public, more graphic, or procedurally distinct material typical of actual court workflows; this representativeness assumption is load-bearing for any generalization beyond the benchmark.

minor comments (1)

[Abstract] The abstract states the total prompt count but does not break down the distribution across the four languages or the two task types (translation vs. summarization); adding this table or sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that abliteration eliminates refusal with minimal task-performance impact is stated without any description of the refusal detection procedure, faithfulness metrics, statistical tests, or prompt-variation controls, so the support for the central result cannot be evaluated from the provided information.

Authors: We agree the abstract is too high-level. In revision we will expand it to briefly note the refusal detection approach (combined keyword and embedding-based matching), the faithfulness metrics (semantic similarity to reference outputs plus task completion rate), and that significance was evaluated via paired Wilcoxon tests across prompt variations. Full procedural details remain in Sections 3.2 and 4. revision: yes
Referee: [TF-RefusalBench and evaluation] TF-RefusalBench construction and evaluation sections: the effectiveness of abliteration (and the broader claim that it enables on-premises criminal-law use) is demonstrated exclusively on passages and prompts drawn from public Swiss Supreme Court rulings. No evidence or analysis is supplied that this public subset reproduces the distribution of refusal-trigger strength or downstream faithfulness penalties that arise with non-public, more graphic, or procedurally distinct material typical of actual court workflows; this representativeness assumption is load-bearing for any generalization beyond the benchmark.

Authors: We acknowledge this limitation. Public rulings already contain detailed violent and sexual offense descriptions that produce measurable refusals and faithfulness degradation. We cannot access non-public dockets to quantify distributional differences. The revised manuscript will add an explicit limitations paragraph stating that the benchmark provides a conservative estimate and that real-world over-alignment may be stronger; the reported multilingual patterns and abliteration results remain valid for the public data regime. revision: partial

standing simulated objections not resolved

The representativeness of public Swiss Supreme Court rulings versus non-public, more graphic criminal-law material for refusal-trigger strength and faithfulness penalties.

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and direct measurements

full rationale

The paper introduces TF-RefusalBench as a new dataset derived from public Swiss Supreme Court rulings and reports direct empirical measurements of refusal rates and task performance under abliteration and prompting. No equations, fitted parameters, or derivations are present. No self-citations are invoked as load-bearing premises for uniqueness or ansatzes. The central results are measurements on the constructed benchmark rather than predictions that reduce to the inputs by construction. Representativeness concerns are validity issues outside the scope of circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that refusal behavior can be isolated and ablated without affecting other capabilities and that the chosen court rulings constitute a valid proxy for real triggering content.

axioms (1)

domain assumption Over-alignment manifests as measurable refusal or disclaimer behavior that is separable from general model capability on legal tasks
Invoked in the design of TF-RefusalBench and the evaluation of abliteration

pith-pipeline@v0.9.1-grok · 5790 in / 1134 out tokens · 27911 ms · 2026-07-01T06:40:29.644850+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 12 canonical work pages · 7 internal anchors

[1]

2024 , url =

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =

2024
[2]

List of Dirty, Naughty, Obscene, and Otherwise Bad Words , author =
[3]

Advances in Neural Information Processing Systems , volume =

Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

2024
[4]

2025 , url =

Heretic: Fully Automatic Censorship Removal for Language Models , author =. 2025 , url =

2025
[5]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b Model Card , author =. arXiv preprint arXiv:2508.10925 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen3 Technical Report

Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gemma 3 Technical Report

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =
[11]

arXiv preprint arXiv:2601.02780 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

2025 , howpublished =

Gemma 4 Model Card , author =. 2025 , howpublished =

2025
[13]

2022 , pages =

Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle =. 2022 , pages =

2022
[14]

2019 , pages =

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =. 2019 , pages =

2019
[15]

Instruction-Following Evaluation for Large Language Models

Instruction-Following Evaluation for Large Language Models , author =. arXiv preprint arXiv:2311.07911 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Method of Paired Comparisons , author =

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author =. Biometrika , volume =. 1952 , url =

1952
[17]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[18]

A General Language Assistant as a Laboratory for Alignment

A General Language Assistant as a Laboratory for Alignment , author =. arXiv preprint arXiv:2112.00861 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[20]

Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , booktitle =
[21]

arXiv preprint arXiv:2509.14233 , year =

work page arXiv
[22]

Yang, An and others , journal =
[23]

2026 , howpublished =

2026
[24]

Safety-Tuned

Bianchi, Federico and Suzgun, Mirac and Attanasio, Giuseppe and R. Safety-Tuned. International Conference on Learning Representations (ICLR) , year =
[25]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[26]

2025 , howpublished =

Mistral Small 3.2 24B , author =. 2025 , howpublished =

2025
[27]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

2024
[28]

arXiv preprint arXiv:2312.03718 , year =

Large Language Models in Law: A Survey , author =. arXiv preprint arXiv:2312.03718 , year =

work page arXiv
[29]

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =
[30]

2019 , url =

Kornilova, Anastassia and Eidelman, Vladimir , booktitle =. 2019 , url =

2019
[31]

Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =

Niklaus, Joel and Chalkidis, Ilias and St. Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =
[32]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

Niklaus, Joel and Matoshi, Veton and Rani, Pooja and Galassi, Andrea and St. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

2023
[33]

arXiv preprint arXiv:2505.12864 , year =

Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang and Hermstr. arXiv preprint arXiv:2505.12864 , year =

work page arXiv
[34]

Publications Manual , year = "1983", publisher =

1983
[35]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[36]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[37]

Dan Gusfield , title =. 1997

1997
[38]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[39]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[40]

Ho , title =

Matthew Dahl and Varun Magesh and Mirac Suzgun and Daniel E. Ho , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2401.01301 , eprinttype =. 2401.01301 , timestamp =

work page doi:10.48550/arxiv.2401.01301 2024
[41]

Royal Society Open Science , year=

Do large language models have a legal duty to tell the truth? , author=. Royal Society Open Science , year=
[42]

Npj Artificial Intelligence , year=

Large language models reflect the ideology of their creators , author=. Npj Artificial Intelligence , year=

[1] [1]

2024 , url =

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =

2024

[2] [2]

List of Dirty, Naughty, Obscene, and Otherwise Bad Words , author =

[3] [3]

Advances in Neural Information Processing Systems , volume =

Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

2024

[4] [4]

2025 , url =

Heretic: Fully Automatic Censorship Removal for Language Models , author =. 2025 , url =

2025

[5] [5]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b Model Card , author =. arXiv preprint arXiv:2508.10925 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Qwen3 Technical Report

Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Gemma 3 Technical Report

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

[10] [11]

arXiv preprint arXiv:2601.02780 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

2025 , howpublished =

Gemma 4 Model Card , author =. 2025 , howpublished =

2025

[12] [13]

2022 , pages =

Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle =. 2022 , pages =

2022

[13] [14]

2019 , pages =

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =. 2019 , pages =

2019

[14] [15]

Instruction-Following Evaluation for Large Language Models

Instruction-Following Evaluation for Large Language Models , author =. arXiv preprint arXiv:2311.07911 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

The Method of Paired Comparisons , author =

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author =. Biometrika , volume =. 1952 , url =

1952

[16] [17]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[17] [18]

A General Language Assistant as a Laboratory for Alignment

A General Language Assistant as a Laboratory for Alignment , author =. arXiv preprint arXiv:2112.00861 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[19] [20]

Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , booktitle =

[20] [21]

arXiv preprint arXiv:2509.14233 , year =

work page arXiv

[21] [22]

Yang, An and others , journal =

[22] [23]

2026 , howpublished =

2026

[23] [24]

Safety-Tuned

Bianchi, Federico and Suzgun, Mirac and Attanasio, Giuseppe and R. Safety-Tuned. International Conference on Learning Representations (ICLR) , year =

[24] [25]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[25] [26]

2025 , howpublished =

Mistral Small 3.2 24B , author =. 2025 , howpublished =

2025

[26] [27]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

2024

[27] [28]

arXiv preprint arXiv:2312.03718 , year =

Large Language Models in Law: A Survey , author =. arXiv preprint arXiv:2312.03718 , year =

work page arXiv

[28] [29]

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

[29] [30]

2019 , url =

Kornilova, Anastassia and Eidelman, Vladimir , booktitle =. 2019 , url =

2019

[30] [31]

Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =

Niklaus, Joel and Chalkidis, Ilias and St. Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =

[31] [32]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

Niklaus, Joel and Matoshi, Veton and Rani, Pooja and Galassi, Andrea and St. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

2023

[32] [33]

arXiv preprint arXiv:2505.12864 , year =

Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang and Hermstr. arXiv preprint arXiv:2505.12864 , year =

work page arXiv

[33] [34]

Publications Manual , year = "1983", publisher =

1983

[34] [35]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[35] [36]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[36] [37]

Dan Gusfield , title =. 1997

1997

[37] [38]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[38] [39]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[39] [40]

Ho , title =

Matthew Dahl and Varun Magesh and Mirac Suzgun and Daniel E. Ho , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2401.01301 , eprinttype =. 2401.01301 , timestamp =

work page doi:10.48550/arxiv.2401.01301 2024

[40] [41]

Royal Society Open Science , year=

Do large language models have a legal duty to tell the truth? , author=. Royal Society Open Science , year=

[41] [42]

Npj Artificial Intelligence , year=

Large language models reflect the ideology of their creators , author=. Npj Artificial Intelligence , year=