"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Pith reviewed 2026-06-27 09:49 UTC · model grok-4.3
The pith
Current lie detectors cannot support high-confidence claims about what language models actually believe.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On prompted lying, all four detectors show positive scaling with model capability, but every activation- and logprob-based detector drops sharply on the trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong at 0.82 balanced accuracy.
What carries the argument
Belief-verified model organisms: reasoning models whose hidden beliefs are verified as opposite to their outputs via chain-of-thought and shown to generalize to new tasks.
If this is right
- Activation- and logprob-based detectors cannot support high-confidence claims about model beliefs on verified organisms.
- The chain-of-thought judge succeeds partly because the verification process favors readable beliefs.
- DYL retains more signal than other activation probes but still falls short of reliable detection.
- New research directions are needed to address the limitations of current lie detectors.
Where Pith is reading between the lines
- Verification methods may need to be adjusted to avoid creating organisms whose beliefs are too easily readable in reasoning traces.
- Scaling trends observed on prompted tasks may not transfer to models with genuinely concealed beliefs.
- Better detectors could support post-hoc investigation techniques if the current performance gap is closed.
Load-bearing premise
The model organisms truly hold hidden beliefs opposite to their outputs, as confirmed in chain-of-thought and shown to generalize to held-out tasks.
What would settle it
An experiment demonstrating that the models' chain-of-thought reasoning does not reflect their actual internal beliefs on held-out tasks would undermine the testbeds.
Figures
read the original abstract
Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs 13 reasoning model organisms whose hidden beliefs (opposite to outputs) are verified via chain-of-thought consistency and held-out task generalization, plus a prompted 'Varied Deception' testbed. It evaluates a CoT judge, logprob classifier, and two activation probes (including new DYL follow-up probe) across 31 models (2B–1T params). On prompted lying, all detectors show positive scaling with capability; on trained organisms, activation- and logprob-based detectors drop sharply (DYL retains most signal) while the CoT judge reaches 0.82 balanced accuracy. The paper concludes current detectors cannot support high-confidence belief claims and releases datasets, organisms, and detectors.
Significance. If the verification of hidden beliefs holds independently of the evaluation modalities, the work supplies needed testbeds for lie-detection research and documents clear limitations of activation/logprob methods relative to CoT. Releasing the datasets, model organisms, and trained detectors is a concrete strength that supports reproducibility and follow-up work. The scaling results on the prompted testbed add empirical data on detector behavior across model sizes.
major comments (1)
- [Abstract] Abstract: the central claim that activation- and logprob-based detectors 'drop sharply' while only the CoT judge remains reliable (0.82 balanced accuracy) rests on the 13 trained organisms having verifiably opposite hidden beliefs. Verification is described as CoT inspection plus held-out generalization, and the abstract itself states this is 'partly as an artefact of our verification process favouring CoT-readable beliefs.' This construction risks making the performance gap unsurprising, as non-CoT detectors are tested against a signal whose existence was never independently guaranteed outside language output. An independent, non-CoT behavioral or activation-based verification method (distinct from the detectors under test) is required to support the claim that current detectors cannot support high-confidence claims about model beliefs.
Simulated Author's Rebuttal
We thank the referee for the constructive report and the opportunity to respond. We address the major comment below, agreeing where the critique identifies a genuine limitation in our verification approach.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that activation- and logprob-based detectors 'drop sharply' while only the CoT judge remains reliable (0.82 balanced accuracy) rests on the 13 trained organisms having verifiably opposite hidden beliefs. Verification is described as CoT inspection plus held-out generalization, and the abstract itself states this is 'partly as an artefact of our verification process favouring CoT-readable beliefs.' This construction risks making the performance gap unsurprising, as non-CoT detectors are tested against a signal whose existence was never independently guaranteed outside language output. An independent, non-CoT behavioral or activation-based verification method (distinct from the detectors under test) is required to support the claim that current detectors cannot support high-confidence claims about model beliefs.
Authors: We agree that our verification of hidden beliefs is language-based (CoT consistency plus held-out task generalization) and therefore confers an advantage to the CoT judge, as we already note in the abstract. The held-out generalization supplies behavioral evidence that the opposite beliefs apply to new tasks and contexts without reusing the verification CoT, but we concede this remains within the language modality and does not constitute fully independent non-linguistic verification. We do not possess an activation-based verification method that would avoid circularity with the detectors under test. We will revise the abstract and add a dedicated limitations subsection to emphasize this methodological constraint, to temper the strength of the central claim, and to outline why non-language verification remains an open research challenge. This will not alter the empirical results but will frame their interpretation more cautiously. revision: yes
Circularity Check
Empirical comparisons on newly constructed testbeds with no equations or self-citations reducing reported accuracies to fitted inputs
full rationale
The paper presents an empirical evaluation of lie detectors on prompted-lying and trained model-organism testbeds. Hidden beliefs in the 13 reasoning model organisms are established via CoT inspection plus generalization to held-out tasks, after which detector performances (including the 0.82 CoT-judge balanced accuracy) are measured directly on those organisms. No equations appear that define a detector output or accuracy in terms of parameters fitted to the same testbed quantities, and no load-bearing self-citation chain is invoked to justify the central claims. The abstract itself flags the verification process as partly favouring CoT-readable beliefs, but this is an acknowledged limitation of the testbed construction rather than a self-definitional reduction of the reported results. The work therefore remains self-contained against external model behavior and receives only a minor circularity score for the inherent dependence of any new testbed on its own verification criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chain-of-thought can be used to verify that a model's hidden belief is the opposite of its output and that this belief generalizes to held-out tasks.
Reference graph
Works this paper leans on
-
[1]
2025 , organization =
Wang, Rowan and Treutlein, Johannes and Roger, Fabien and Hubinger, Evan and Marks, Sam , title =. 2025 , organization =
2025
-
[2]
2025 , organization =
Goldowsky-Dill, Nicholas and Chughtai, Bilal and Heimersheim, Stefan and Hobbhahn, Marius , title =. 2025 , organization =
2025
-
[3]
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions , doi =
Pacchiardi, Lorenzo and Chan, Alex J and Mindermann, Sören and Moscovitz, Ilan and Pan, Alexa Y and Gal, Yarin and Evans, Owain and Brauner, Jan , month =. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions , doi =. 2023 , journal =
2023
-
[4]
2025 , organization =
Smith, Lewis and Chughtai, Bilal and Nanda, Neel , title =. 2025 , organization =
2025
-
[5]
2025 , organization =
Karvonen, Adam and Chua, James and Dumas, Clément and Fraser-Taliente, Kit and Kantamneni, Subhash and Minder, Julian and Ong, Euan and Sharma, Arnab Sen and Wen, Daniel and Evans, Owain and Marks, Samuel , title =. 2025 , organization =
2025
-
[6]
2025 , organization =
Kretschmar, Kieron and Laurito, Walter and Maiya, Sharan and Marks, Samuel , title =. 2025 , organization =
2025
-
[7]
Catching AIs red-handed , url =
Shlegeris, Buck and Greenblatt, Ryan , month =. Catching AIs red-handed , url =. 2024 , organization =
2024
-
[8]
2023 , organization =
Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , title =. 2023 , organization =
2023
-
[9]
2025 , organization =
Marks, Samuel and Treutlein, Johannes and Bricken, Trenton and Lindsey, Jack and Marcus, Jonathan and Mishra-Sharma, Siddharth and Ziegler, Daniel and Ameisen, Emmanuel and Batson, Joshua and Belonax, Tim and Bowman, Samuel R and Carter, Shan and Chen, Brian and Cunningham, Hoagy and Denison, Carson and Dietz, Florian and Golechha, Satvik and Khan, Akbir ...
2025
-
[10]
2025 , organization =
Taylor, Jordan and Black, Sid and Bowen, Dillon and Read, Thomas and Golechha, Satvik and Zelenka-Martin, Alex and Makins, Oliver and Kissane, Connor and Ayonrinde, Kola and Merizian, Jacob and Marks, Samuel and Cundy, Chris and Bloom, Joseph , title =. 2025 , organization =
2025
-
[11]
2025 , organization =
Bowman, Samuel and Wagner, Misha and Roger, Fabien and Karnofsky, Holden , title =. 2025 , organization =
2025
-
[12]
We found an open weight model that games alignment honeypots , url =
Read, Thomas and Bloom, Joseph , month =. We found an open weight model that games alignment honeypots , url =. 2026 , organization =
2026
-
[13]
The Internal State of an LLM Knows When It's Lying , url =
Azaria, Amos and Mitchell, Tom , month =. The Internal State of an LLM Knows When It's Lying , url =
-
[14]
Hashimoto, Tatsunori , title =
Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and B. Hashimoto, Tatsunori , title =
-
[15]
2026 , organization =
Meridian labs , title =. 2026 , organization =
2026
-
[16]
2017 , organization =
Welbl, Johannes and Liu, Nelson F and Gardner, Matt , title =. 2017 , organization =
2017
-
[17]
The Persona Selection Model: Why AI Assistants might Behave like Humans , url =
Marks, Sam and Lindsey, Jack and Olah, Christopher , month =. The Persona Selection Model: Why AI Assistants might Behave like Humans , url =
-
[18]
2026 , organization =
Sheshadri, Abhay and Ewart, Aidan and Fronsdal, Kai and Gupta, Isha and Bowman, Samuel R and Price, Sara and Marks, Samuel and Wang, Rowan , title =. 2026 , organization =
2026
-
[19]
2025 , organization =
Nardo, Cleo and Parrack, Avi and jordinne , title =. 2025 , organization =
2025
-
[20]
2025 , organization =
Parrack, Avi and Attubato, Carlo Leonardo and Heimersheim, Stefan , title =. 2025 , organization =
2025
-
[21]
2024 , url =
Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger , title =. 2024 , url =
2024
-
[22]
and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J
Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...
2023
-
[23]
2024 , organization =
Bürger, Lennart and Hamprecht, Fred A and Nadler, Boaz , title =. 2024 , organization =
2024
-
[24]
2023 , organization =
Campbell, James and Ren, Richard and Guo, Phillip , title =. 2023 , organization =
2023
-
[25]
2025 , organization =
Tyagi, Riya and Heimersheim, Stefan , title =. 2025 , organization =
2025
-
[26]
2024 , organization =
Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , title =. 2024 , organization =
2024
-
[27]
2026 , organization =
Zeng, Aohan and Lv, Xin and Hou, Zhenyu and Du, Zhengxiao and Zheng, Qinkai and Chen, Bin and Yin, Da and Ge, Chendi and Huang, Chenghua and Xie, Chengxing and Zhu, Chenzheng and Yin, Congfeng and Wang, Cunxiang and Pan, Gengzheng and Zeng, Hao and Zhang, Haoke and Wang, Haoran and Chen, Huilong and Zhang, Jiajie and Jiao, Jian and Guo, Jiaqi and Wang, Ji...
2026
-
[28]
2026 , organization =
Bai, Tongtong and Bai, Yifan and Bao, Yiping and H, Cai S and Cao, Yuan and Charles, Y and S, Che H and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Chen, Yimin and Chen, Y...
2026
-
[29]
and Zettlemoyer, Luke , month =
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , month =. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , doi =. 2017 , organization =
2017
-
[30]
2025 , organization =
Ren, Richard and Agarwal, Arunim and Mazeika, Mantas and Menghini, Cristina and Vacareanu, Robert and Kenstler, Brad and Yang, Mick and Barrass, Isabelle and Gatti, Alice and Yin, Xuwang and Trevino, Eduardo and Geralnik, Matias and Khoja, Adam and Lee, Dean and Yue, Summer and Hendrycks, Dan , title =. 2025 , organization =
2025
-
[31]
2024 , journal =
OED , title =. 2024 , journal =. doi:10.1093\/OED\/5758472109 , url =
2024
-
[32]
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation , url =
Bartosz Cywiński and Casademunt, Helena and Tran, Khoi and aryaj and Marks, Sam and Nanda, Neel , month =. Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation , url =. 2026 , organization =
2026
-
[33]
2023 , organization =
Wagner, Misha , title =. 2023 , organization =
2023
-
[34]
2025 , organization =
Korbak, Tomek and Clymer, Joshua and Hilton, Benjamin and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , organization =
2025
-
[35]
2024 , organization =
Buhl, Marie Davidsen and Sett, Gaurav and Koessler, Leonie and Schuett, Jonas and Anderljung, Markus , title =. 2024 , organization =
2024
-
[36]
Safety Cases: How to Justify the Safety of Advanced AI Systems , doi =
Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , month =. Safety Cases: How to Justify the Safety of Advanced AI Systems , doi =. 2024 , organization =
2024
-
[37]
and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared , month =
Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Da...
2022
-
[38]
2025 , organization =
Cywiński, Bartosz and Ryd, Emil and Wang, Rowan and Rajamanoharan, Senthooran and Nanda, Neel and Conmy, Arthur and Marks, Samuel , title =. 2025 , organization =
2025
-
[39]
Current activation oracles are hard to use , url =
aryaj and Senthooran Rajamanoharan and Nanda, Neel , month =. Current activation oracles are hard to use , url =. 2026 , organization =
2026
-
[40]
Refusal in Language Models Is Mediated by a Single Direction , doi =
Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Rimsky, Nina and Gurnee, Wes and Nanda, Neel , month =. Refusal in Language Models Is Mediated by a Single Direction , doi =. 2024 , organization =
2024
-
[41]
2026 , organization =
Natarajan, Vikram and Jain, Devina and Arora, Shivam and Golechha, Satvik and Bloom, Joseph , title =. 2026 , organization =
2026
-
[42]
2023 , organization =
Herrmann, Daniel and Levinstein, Ben , title =. 2023 , organization =
2023
-
[43]
2025 , organization =
Arena , title =. 2025 , organization =
2025
-
[44]
2025 , organization =
Bricken, Trenton and Wang, Rowan and Bowman, Sam and Ong, Euan and Treutlein, Johannes and Wu, Jeff and Hubinger, Evan and Marks, Samuel , title =. 2025 , organization =
2025
-
[45]
2026 , organization =
FAR.AI , title =. 2026 , organization =
2026
-
[46]
2026 , organization =
Comparison of AI Models across Quality, Performance, Price | Artificial Analysis , url =. 2026 , organization =
2026
-
[47]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , url =
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , month =. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , url =. 2018 , organization =
2018
-
[48]
and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B
Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...
2024
-
[49]
Building Blocks for Assurance Cases , doi =
Bloomfield, Robin and Netkachova, Kateryna , month =. Building Blocks for Assurance Cases , doi =. 2014 , journal =
2014
-
[50]
2026 , organization =
Qwen 3.6 , url =. 2026 , organization =
2026
-
[51]
2025 , organization =
Qwen 3.5 , url =. 2025 , organization =
2025
-
[52]
2025 , organization =
Slocum, Stewart and Minder, Julian and Dumas, Clément and Sleight, Henry and Greenblatt, Ryan and Marks, Samuel and Wang, Rowan , title =. 2025 , organization =
2025
-
[53]
2024 , organization =
Greenblatt, Ryan and Roger, Fabien and Krasheninnikov, Dmitrii and Krueger, David , title =. 2024 , organization =
2024
-
[54]
2025 , organization =
Cywiński, Bartosz and Ryd, Emil and Rajamanoharan, Senthooran and Nanda, Neel , title =. 2025 , organization =
2025
-
[55]
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA , doi =
Kalajdzievski, Damjan , month =. A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA , doi =. 2023 , organization =
2023
-
[56]
2026 , organization =
Yueh-Han, Chen and McCarthy, Robert and Lee, Bruce W and He, He and Kivlichan, Ian and Baker, Bowen and Carroll, Micah and Korbak, Tomek , title =. 2026 , organization =
2026
-
[57]
2026 , organization =
Shenoy, Keshav and Yang, Li and Sheshadri, Abhay and Mindermann, Sören and Lindsey, Jack and Marks, Sam and Wang, Rowan , title =. 2026 , organization =
2026
-
[58]
2025 , organization =
Sheshadri, Abhay and Gupta, Rohan and Nishimura-Gasparian, Kei and Marks, Sam and Wang, Rowan and Treutlein, Johannes , title =. 2025 , organization =
2025
-
[59]
Unsloth , url =
Han, Daniel and Han, Michael and. Unsloth , url =. 2023 , organization =
2023
-
[60]
On-Policy Distillation , url =
Lu, Kevin and. On-Policy Distillation , url =. 2025 , organization =
2025
-
[61]
Comparison of ai models across quality, performance, price | artificial analysis, 2026
Artificial Analysis . Comparison of ai models across quality, performance, price | artificial analysis, 2026. URL https://artificialanalysis.ai/models
2026
-
[62]
Current activation oracles are hard to use, 03 2026
aryaj, Senthooran Rajamanoharan, and Neel Nanda. Current activation oracles are hard to use, 03 2026. URL https://www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to-use
2026
-
[63]
The internal state of an llm knows when it's lying, 10 2023
Amos Azaria and Tom Mitchell. The internal state of an llm knows when it's lying, 10 2023. URL https://arxiv.org/pdf/2304.13734
Pith/arXiv arXiv 2023
-
[64]
Building blocks for assurance cases
Robin Bloomfield and Kateryna Netkachova. Building blocks for assurance cases. City Research Online (City University London), pages 186--191, 11 2014. doi:10.1109/issrew.2014.72. URL https://ieeexplore.ieee.org/document/6983836
-
[65]
Anthropic's pilot sabotage risk report, 2025
Samuel Bowman, Misha Wagner, Fabien Roger, and Holden Karnofsky. Anthropic's pilot sabotage risk report, 2025. URL https://alignment.anthropic.com/2025/sabotage-risk-report/
2025
-
[66]
Building and evaluating alignment auditing agents, 2025
Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, and Samuel Marks. Building and evaluating alignment auditing agents, 2025. URL https://alignment.anthropic.com/2025/automated-auditing/
2025
-
[67]
Safety cases for frontier ai, 2024
Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier ai, 2024. URL https://arxiv.org/abs/2410.21572
arXiv 2024
-
[68]
Safety cases: How to justify the safety of advanced ai systems, 03 2024
Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. Safety cases: How to justify the safety of advanced ai systems, 03 2024. URL https://arxiv.org/abs/2403.10462
arXiv 2024
-
[69]
Towards eliciting latent knowledge from llms with mechanistic interpretability, 2025 a
Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, and Neel Nanda. Towards eliciting latent knowledge from llms with mechanistic interpretability, 2025 a . URL https://arxiv.org/abs/2505.14352
arXiv 2025
-
[70]
Eliciting secret knowledge from language models, 2025 b
Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models, 2025 b . URL https://arxiv.org/abs/2510.01070
arXiv 2025
-
[71]
Detecting strategic deception using linear probes, 2025
Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes, 2025. URL https://arxiv.org/abs/2502.03407
arXiv 2025
-
[72]
Ai control: Improving safety despite intentional subversion, 2023
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion, 2023. URL https://arxiv.org/abs/2312.06942
arXiv 2023
-
[73]
Stress-testing capability elicitation with password-locked models, 2024
Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, and David Krueger. Stress-testing capability elicitation with password-locked models, 2024. URL https://arxiv.org/abs/2405.19550
arXiv 2024
-
[74]
Still no lie detector for llms, 2023
Daniel Herrmann and Ben Levinstein. Still no lie detector for llms, 2023. URL https://www.lesswrong.com/posts/bCQbSFrnnAk7CJNpM/still-no-lie-detector-for-llms
2023
-
[75]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 05 2017. URL https://arxiv.org/abs/1705.03551
Pith/arXiv arXiv 2017
-
[76]
Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2025
Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Samuel Marks. Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2025. URL https://arxiv.org/abs/2512.15674
arXiv 2025
-
[77]
A sketch of an ai control safety case, 2025
Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an ai control safety case, 2025. URL https://arxiv.org/abs/2501.17315
arXiv 2025
-
[78]
Liars' bench: Evaluating lie detectors for language models, 2025
Kieron Kretschmar, Walter Laurito, Sharan Maiya, and Samuel Marks. Liars' bench: Evaluating lie detectors for language models, 2025. URL https://arxiv.org/abs/2511.16035
arXiv 2025
-
[79]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel ...
Pith/arXiv arXiv 2024
-
[80]
On-policy distillation, 10 2025
Kevin Lu and Thinking Machines . On-policy distillation, 10 2025. URL https://thinkingmachines.ai/blog/on-policy-distillation/
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.