pith. sign in

arxiv: 2606.22085 · v1 · pith:ZIZW65ZRnew · submitted 2026-06-20 · 💻 cs.AI · cs.CL· cs.LG

Can Reasoning Models Detect Changes to their Chains of Thought?

Pith reviewed 2026-06-26 11:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords chain of thoughtreasoning modelsmodel interventiondetectionAI safetyprefilling
0
0 comments X

The pith

Reasoning models detect edits to their chains of thought only modestly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current reasoning models can spot when their chain of thought has been changed by an external intervention. This matters because many proposed uses of AI involve editing the model's reasoning steps to improve safety or performance, and such edits would fail if the model notices and adjusts its output accordingly. The study finds that detection accuracy is low, models cannot usually say what changed, and performance is similar whether the original CoT was their own or someone else's. These results suggest that CoT editing may often go undetected in practice.

Core claim

Recent reasoning models show only modest accuracy at detecting interventions on their chains of thought, whether those interventions occur during or after reasoning and whether the chains come from the model itself or from other models. Models also have difficulty specifying exactly how the chain was modified.

What carries the argument

The CoT detection task, in which models receive a reasoning trace that may have been prefilling or altered and are prompted to report whether and how it differs from what they would have generated.

If this is right

  • CoT prefilling with stronger models' reasoning can likely proceed without the target model noticing the substitution.
  • Interventions to remove unsafe reasoning steps may succeed because the model does not flag the alteration.
  • Detection performance does not improve when the CoT belongs to the model being tested versus another model.
  • Modest detection holds both while the model is still reasoning and after the chain is complete.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could use CoT editing as a reliable way to steer model behavior without triggering internal consistency checks.
  • Future experiments might vary the subtlety of the edits to find the threshold at which detection becomes reliable.
  • Similar tests on non-reasoning models could reveal whether chain-of-thought use itself affects self-monitoring ability.

Load-bearing premise

The prompts and edit methods used in the experiments measure genuine detection ability rather than being influenced by how the questions are worded or how the changes are introduced.

What would settle it

A follow-up experiment that achieves high accuracy in identifying both the presence and the specific nature of CoT modifications across multiple models and edit types would falsify the claim of only modest detection.

Figures

Figures reproduced from arXiv: 2606.22085 by Chengyuan Xue, Miriam Wanner, Sathvik Napa, Utkarsh Singh, William Walden.

Figure 1
Figure 1. Figure 1: We prefill models with CoTs that have been [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detection accuracy for the (alerted) completed (dark bars; §4.1) and partial (light bars; §4.2) conditions. The Ione and Erep are shared across conditions; Dhalf and Epar are unique to completed and partial, respectively. Error bars indicate 95% CIs; ∗ denotes significantly different from 50% at α = 0.05 on a two-sided z-test. 4 Detecting Changes Here, we investigate whether models can detect modifications… view at source ↗
Figure 4
Figure 4. Figure 4: Model accuracy in identifying where a CoT modification was made on MMLU. 5 Localizing Changes We have shown that models have only moderate ability to detect CoT modifications. We now in￾vestigate whether they are able to identify where modifications occur. To facilitate clear localiza￾tion of changes, we make modifications involving only a single step—either inserting (Ione), deleting (Done), or paraphrasi… view at source ↗
Figure 5
Figure 5. Figure 5: unalerted results in the completed condition [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Complete results for the experiment described in § [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full change localization accuracy results (§ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of replacing the fixed irrelevant statement used for Ione in the main text (left column) with a dynamically LLM-generated statement that is relevant to the problem but that is a non-sequitur (Ione - LLM; right column). Detection accuracy plummets across the board. These results are from the partial condition (§4.2) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Main task accuracy for all models on all datasets. We consistently obtain best results with [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

There are many reasons one may want to edit a model's chain of thought (CoT) -- e.g., to prefill it with reasoning from a stronger model or to remove steps that may yield unsafe outputs. The success of these interventions plausibly depends on a model's inability to notice them, as the model may alter its behavior if it suspects tampering. In this work, we study whether recent reasoning models are able to detect such interventions on their CoTs under a variety of conditions: both during reasoning and after it, and when prefilled both with their own CoTs and with those of other models. Broadly, we find that (i) models exhibit only very modest detection accuracy; (ii) models struggle to identify *how* their CoT was modified; and (iii) models are about as good at detecting changes to their own CoTs as to those of other models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates whether recent reasoning models can detect edits to their chains of thought (CoTs), both during and after generation and when the CoT is their own or from another model. It reports three main empirical findings: models achieve only modest accuracy in detecting such interventions, struggle to identify the specific type of modification, and show comparable detection performance on their own CoTs versus those of other models.

Significance. If the empirical results are robust, the work provides evidence that CoT interventions (e.g., for safety or capability enhancement) are unlikely to be noticed by the model itself, which bears on the feasibility of such techniques without triggering behavioral changes. The study contributes behavioral data on model introspection and self-monitoring in the context of reasoning traces.

major comments (2)
  1. [Methods] Methods (experimental design): The study does not report control conditions using unmodified CoTs to measure the base rate at which models report 'changes' in the absence of any intervention. Without these controls, the modest detection accuracy cannot be distinguished from generic prompt-induced affirmative responses, undermining interpretation of findings (i) and (iii).
  2. [Results] Results (detection accuracy): The abstract and reported findings provide no sample sizes, statistical tests, confidence intervals, or error bars, making it impossible to assess whether the 'modest accuracy' and 'no difference between own/other' claims are statistically supported or merely descriptive.
minor comments (1)
  1. [Abstract] The abstract refers to 'a variety of conditions' but does not enumerate them; a brief enumeration in the abstract or introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of experimental design and reporting. We address each major point below and commit to revisions that strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Methods] The study does not report control conditions using unmodified CoTs to measure the base rate at which models report 'changes' in the absence of any intervention. Without these controls, the modest detection accuracy cannot be distinguished from generic prompt-induced affirmative responses, undermining interpretation of findings (i) and (iii).

    Authors: We agree this is a valid concern. The current experiments focus on modified CoTs but do not explicitly include unmodified controls to quantify false-positive rates. In the revised manuscript we will add these control conditions (prompting models to detect changes on their original, unmodified CoTs) and report the resulting base rates alongside the main results. This will allow clearer interpretation of the reported detection accuracies. revision: yes

  2. Referee: [Results] Results (detection accuracy): The abstract and reported findings provide no sample sizes, statistical tests, confidence intervals, or error bars, making it impossible to assess whether the 'modest accuracy' and 'no difference between own/other' claims are statistically supported or merely descriptive.

    Authors: We acknowledge the omission. The full manuscript contains the underlying trial counts but does not present sample sizes, statistical tests, confidence intervals, or error bars in the abstract or main result summaries. We will revise to include these details (e.g., N per condition, appropriate tests for accuracy differences, and error bars on figures) so that the strength of the 'modest accuracy' and 'own vs. other' comparisons can be evaluated rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical behavioral study with no derivations or self-referential structure

full rationale

The paper is an empirical investigation of model behavior under CoT interventions. It reports experimental results on detection accuracy, identification of modification type, and own-vs-other CoT performance. No equations, fitted parameters, uniqueness theorems, ansatzes, or derivation chains appear in the provided text. All claims rest on direct measurement of prompted responses rather than any reduction of outputs to inputs by construction. This is a standard non-circular empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are present; the work is an empirical measurement study.

pith-pipeline@v0.9.1-grok · 5690 in / 1070 out tokens · 33768 ms · 2026-06-26T11:39:22.814368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 170 canonical work pages

  1. [1]

    Proceedings of the 21st Workshop on Biomedical Language Processing. 2022

  2. [2]

    Explainable Assessment of Healthcare Articles with QA

    Boissonnet, Alodie and Saeidi, Marzieh and Plachouras, Vassilis and Vlachos, Andreas. Explainable Assessment of Healthcare Articles with QA. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.1

  3. [3]

    A sequence-to-sequence approach for document-level relation extraction

    Giorgi, John and Bader, Gary and Wang, Bo. A sequence-to-sequence approach for document-level relation extraction. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.2

  4. [4]

    Position-based Prompting for Health Outcome Generation

    Abaho, Micheal and Bollegala, Danushka and Williamson, Paula and Dodd, Susanna. Position-based Prompting for Health Outcome Generation. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.3

  5. [5]

    How You Say It Matters: Measuring the Impact of Verbal Disfluency Tags on Automated Dementia Detection

    Farzana, Shahla and Deshpande, Ashwin and Parde, Natalie. How You Say It Matters: Measuring the Impact of Verbal Disfluency Tags on Automated Dementia Detection. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.4

  6. [6]

    Zero-Shot Aspect-Based Scientific Document Summarization using Self-Supervised Pre-training

    Soleimani, Amir and Nikoulina, Vassilina and Favre, Benoit and Ait Mokhtar, Salah. Zero-Shot Aspect-Based Scientific Document Summarization using Self-Supervised Pre-training. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.5

  7. [7]

    Data Augmentation for Biomedical Factoid Question Answering

    Pappas, Dimitris and Malakasiotis, Prodromos and Androutsopoulos, Ion. Data Augmentation for Biomedical Factoid Question Answering. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.6

  8. [8]

    Slot Filling for Biomedical Information Extraction

    Papanikolaou, Yannis and Staib, Marlene and Grace, Justin Joshua and Bennett, Francine. Slot Filling for Biomedical Information Extraction. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.7

  9. [9]

    Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations

    Zeng, Sihang and Yuan, Zheng and Yu, Sheng. Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.8

  10. [10]

    B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model

    Yuan, Hongyi and Yuan, Zheng and Gan, Ruyi and Zhang, Jiaxing and Xie, Yutao and Yu, Sheng. B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.9

  11. [11]

    Incorporating Medical Knowledge to Transformer-based Language Models for Medical Dialogue Generation

    Naseem, Usman and Bandi, Ajay and Raza, Shaina and Rashid, Junaid and Chakravarthi, Bharathi Raja. Incorporating Medical Knowledge to Transformer-based Language Models for Medical Dialogue Generation. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.10

  12. [12]

    Memory-aligned Knowledge Graph for Clinically Accurate Radiology Image Report Generation

    Yan, Sixing. Memory-aligned Knowledge Graph for Clinically Accurate Radiology Image Report Generation. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.11

  13. [13]

    Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts

    Phan, Uyen and Nguyen, Nhung. Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.12

  14. [14]

    Auxiliary Learning for Named Entity Recognition with Multiple Auxiliary Biomedical Training Data

    Watanabe, Taiki and Ichikawa, Tomoya and Tamura, Akihiro and Iwakura, Tomoya and Ma, Chunpeng and Kato, Tsuneo. Auxiliary Learning for Named Entity Recognition with Multiple Auxiliary Biomedical Training Data. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.13

  15. [15]

    SNP 2 V ec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

    Cahyawijaya, Samuel and Yu, Tiezheng and Liu, Zihan and Zhou, Xiaopu and Mak, Tze Wing Tiffany and Ip, Yuk Yu Nancy and Fung, Pascale. SNP 2 V ec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.14

  16. [16]

    Biomedical NER using Novel Schema and Distant Supervision

    Khandelwal, Anshita and Kar, Alok and Chikka, Veera Raghavendra and Karlapalem, Kamalakar. Biomedical NER using Novel Schema and Distant Supervision. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.15

  17. [17]

    Improving Supervised Drug-Protein Relation Extraction with Distantly Supervised Models

    Iinuma, Naoki and Miwa, Makoto and Sasaki, Yutaka. Improving Supervised Drug-Protein Relation Extraction with Distantly Supervised Models. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.16

  18. [18]

    Named Entity Recognition for Cancer Immunology Research Using Distant Supervision

    Trieu, Hai-Long and Miwa, Makoto and Ananiadou, Sophia. Named Entity Recognition for Cancer Immunology Research Using Distant Supervision. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.17

  19. [19]

    Intra-Template Entity Compatibility based Slot-Filling for Clinical Trial Information Extraction

    Witte, Christian and Cimiano, Philipp. Intra-Template Entity Compatibility based Slot-Filling for Clinical Trial Information Extraction. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.18

  20. [20]

    Pretrained Biomedical Language Models for Clinical NLP in S panish

    Carrino, Casimiro Pio and Llop, Joan and P \`a mies, Marc and Guti \'e rrez-Fandi \ n o, Asier and Armengol-Estap \'e , Jordi and Silveira-Ocampo, Joaqu \'i n and Valencia, Alfonso and Gonzalez-Agirre, Aitor and Villegas, Marta. Pretrained Biomedical Language Models for Clinical NLP in S panish. Proceedings of the 21st Workshop on Biomedical Language Proc...

  21. [21]

    Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts

    Amin, Saadullah and Pokaratsiri Goldstein, Noon and Wixted, Morgan and Garcia-Rudolph, Alejandro and Mart \'i nez-Costa, Catalina and Neumann, Guenter. Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.20

  22. [22]

    VPAI \_ L ab at M ed V id QA 2022: A Two-Stage Cross-modal Fusion Method for Medical Instructional Video Classification

    Li, Bin and Weng, Yixuan and Xia, Fei and Sun, Bin and Li, Shutao. VPAI \_ L ab at M ed V id QA 2022: A Two-Stage Cross-modal Fusion Method for Medical Instructional Video Classification. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.21

  23. [23]

    G en C ompare S um: a hybrid unsupervised summarization method using salience

    Bishop, Jennifer and Xie, Qianqian and Ananiadou, Sophia. G en C ompare S um: a hybrid unsupervised summarization method using salience. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.22

  24. [24]

    B io C ite: A Deep Learning-based Citation Linkage Framework for Biomedical Research Articles

    Singha Roy, Sudipta and Mercer, Robert E. B io C ite: A Deep Learning-based Citation Linkage Framework for Biomedical Research Articles. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.23

  25. [25]

    Low Resource Causal Event Detection from Biomedical Literature

    Liang, Zhengzhong and Noriega-Atala, Enrique and Morrison, Clayton and Surdeanu, Mihai. Low Resource Causal Event Detection from Biomedical Literature. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.24

  26. [26]

    Overview of the M ed V id QA 2022 Shared Task on Medical Video Question-Answering

    Gupta, Deepak and Demner-Fushman, Dina. Overview of the M ed V id QA 2022 Shared Task on Medical Video Question-Answering. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.25

  27. [27]

    Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations

    Richie, Russell and Grover, Sachin and Tsui, Fuchiang (Rich). Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.26

  28. [28]

    and Zuo, Xu and Hu, Yan and Kuttichi Keloth, Vipina and Li, Jianfu and Zheng, W

    Das, Avisha and Selek, Salih and Warner, Alia R. and Zuo, Xu and Hu, Yan and Kuttichi Keloth, Vipina and Li, Jianfu and Zheng, W. Jim and Xu, Hua. Conversational Bots for Psychotherapy: A Study of Generative Transformer Models Using Domain-specific Dialogues. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bi...

  29. [29]

    BEEDS : Large-Scale Biomedical Event Extraction using Distant Supervision and Question Answering

    Wang, Xing David and Leser, Ulf and Weber, Leon. BEEDS : Large-Scale Biomedical Event Extraction using Distant Supervision and Question Answering. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.28

  30. [30]

    Data Augmentation for Rare Symptoms in Vaccine Side-Effect Detection

    Kim, Bosung and Nakashole, Ndapa. Data Augmentation for Rare Symptoms in Vaccine Side-Effect Detection. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.29

  31. [31]

    Improving R omanian B io NER Using a Biologically Inspired System

    Mitrofan, Maria and Pais, Vasile. Improving R omanian B io NER Using a Biologically Inspired System. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.30

  32. [32]

    B angla B io M ed: A Biomedical Named-Entity Annotated Corpus for B angla ( B engali)

    Sazzed, Salim. B angla B io M ed: A Biomedical Named-Entity Annotated Corpus for B angla ( B engali). Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.31

  33. [33]

    ICDB ig B ird: A Contextual Embedding Model for ICD Code Classification

    Michalopoulos, George and Malyska, Michal and Sahar, Nicola and Wong, Alexander and Chen, Helen. ICDB ig B ird: A Contextual Embedding Model for ICD Code Classification. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.32

  34. [34]

    Doctor XA v I er: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation

    Ngai, Hillary and Rudzicz, Frank. Doctor XA v I er: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.33

  35. [35]

    DISTANT - CTO : A Zero Cost, Distantly Supervised Approach to Improve Low-Resource Entity Extraction Using Clinical Trials Literature

    Dhrangadhariya, Anjani and M. DISTANT - CTO : A Zero Cost, Distantly Supervised Approach to Improve Low-Resource Entity Extraction Using Clinical Trials Literature. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.34

  36. [36]

    and Peng, Yifan

    Tang, Liyan and Kooragayalu, Shravan and Wang, Yanshan and Ding, Ying and Durrett, Greg and Rousseau, Justin F. and Peng, Yifan. E cho G en: Generating Conclusions from Echocardiogram Notes. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.35

  37. [37]

    Quantifying Clinical Outcome Measures in Patients with Epilepsy Using the Electronic Health Record

    Xie, Kevin and Litt, Brian and Roth, Dan and Ellis, Colin A. Quantifying Clinical Outcome Measures in Patients with Epilepsy Using the Electronic Health Record. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.36

  38. [38]

    Comparing Encoder-Only and Encoder-Decoder Transformers for Relation Extraction from Biomedical Texts: An Empirical Study on Ten Benchmark Datasets

    Sarrouti, Mourad and Tao, Carson and Mamy Randriamihaja, Yoann. Comparing Encoder-Only and Encoder-Decoder Transformers for Relation Extraction from Biomedical Texts: An Empirical Study on Ten Benchmark Datasets. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.37

  39. [39]

    Utility Preservation of Clinical Text After De-Identification

    Vakili, Thomas and Dalianis, Hercules. Utility Preservation of Clinical Text After De-Identification. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.38

  40. [40]

    Horses to Zebras: Ontology-Guided Data Augmentation and Synthesis for ICD -9 Coding

    Falis, Mat \'u s and Dong, Hang and Birch, Alexandra and Alex, Beatrice. Horses to Zebras: Ontology-Guided Data Augmentation and Synthesis for ICD -9 Coding. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.39

  41. [41]

    Towards Automatic Curation of Antibiotic Resistance Genes via Statement Extraction from Scientific Papers: A Benchmark Dataset and Models

    Chandak, Sidhant and Zhang, Liqing and Brown, Connor and Huang, Lifu. Towards Automatic Curation of Antibiotic Resistance Genes via Statement Extraction from Scientific Papers: A Benchmark Dataset and Models. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.40

  42. [42]

    Model Distillation for Faithful Explanations of Medical Code Predictions

    Wood-Doughty, Zach and Cachola, Isabel and Dredze, Mark. Model Distillation for Faithful Explanations of Medical Code Predictions. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.41

  43. [43]

    and Szolovits, Peter

    Liang, Jennifer J and Lehman, Eric and Iyengar, Ananya and Mahajan, Diwakar and Raghavan, Preethi and Chang, Cindy Y. and Szolovits, Peter. Towards Generalizable Methods for Automating Risk Score Calculation. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.42

  44. [44]

    D o SSIER at M ed V id QA 2022: Text-based Approaches to Medical Video Answer Localization Problem

    Kusa, Wojciech and Peikos, Georgios and Espitia, \'O scar and Hanbury, Allan and Pasi, Gabriella. D o SSIER at M ed V id QA 2022: Text-based Approaches to Medical Video Answer Localization Problem. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.43

  45. [45]

    Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022

  46. [46]

    Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

    Jin, Xisen and Zhang, Dejiao and Zhu, Henghui and Xiao, Wei and Li, Shang-Wen and Wei, Xiaokai and Arnold, Andrew and Ren, Xiang. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigsc...

  47. [47]

    Using ASR -Generated Text for Spoken Language Modeling

    Herv \'e , Nicolas and Pelloin, Valentin and Favre, Benoit and Dary, Franck and Laurent, Antoine and Meignier, Sylvain and Besacier, Laurent. Using ASR -Generated Text for Spoken Language Modeling. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.2

  48. [48]

    You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

    Talat, Zeerak and N \'e v \'e ol, Aur \'e lie and Biderman, Stella and Clinciu, Miruna and Dey, Manan and Longpre, Shayne and Luccioni, Sasha and Masoud, Maraim and Mitchell, Margaret and Radev, Dragomir and Sharma, Shanya and Subramonian, Arjun and Tae, Jaesung and Tan, Samson and Tunuguntla, Deepak and Van Der Wal, Oskar. You reap what you sow: On the C...

  49. [49]

    Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

    Kobayashi, Sosuke and Kiyono, Shun and Suzuki, Jun and Inui, Kentaro. Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.4

  50. [50]

    UNIREX : A Unified Learning Framework for Language Model Rationale Extraction

    Chan, Aaron and Sanjabi, Maziar and Mathias, Lambert and Tan, Liang and Nie, Shaoliang and Peng, Xiaochang and Ren, Xiang and Firooz, Hamed. UNIREX : A Unified Learning Framework for Language Model Rationale Extraction. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/...

  51. [51]

    Pipelines for Social Bias Testing of Large Language Models

    Nozza, Debora and Bianchi, Federico and Hovy, Dirk. Pipelines for Social Bias Testing of Large Language Models. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.6

  52. [52]

    Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

    De Toni, Francesco and Akiki, Christopher and De La Rosa, Javier and Fourrier, Cl \'e mentine and Manjavacas, Enrique and Schweter, Stefan and Van Strien, Daniel. Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. do...

  53. [53]

    A Holistic Assessment of the Carbon Footprint of Noor, a Very Large A rabic Language Model

    Lakim, Imad and Almazrouei, Ebtesam and Abualhaol, Ibrahim and Debbah, Merouane and Launay, Julien. A Holistic Assessment of the Carbon Footprint of Noor, a Very Large A rabic Language Model. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.8

  54. [54]

    GPT - N eo X -20 B : An Open-Source Autoregressive Language Model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

  55. [55]

    Dataset Debt in Biomedical Language Modeling

    Fries, Jason and Seelam, Natasha and Altay, Gabriel and Weber, Leon and Kang, Myungsun and Datta, Debajyoti and Su, Ruisi and Garda, Samuele and Wang, Bo and Ott, Simon and Samwald, Matthias and Kusa, Wojciech. Dataset Debt in Biomedical Language Modeling. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large La...

  56. [56]

    Emergent Structures and Training Dynamics in Large Language Models

    Teehan, Ryan and Clinciu, Miruna and Serikov, Oleg and Szczechla, Eliza and Seelam, Natasha and Mirkin, Shachar and Gokaslan, Aaron. Emergent Structures and Training Dynamics in Large Language Models. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. 2022. doi:10.18653/v1/2022.bigscience-1.11

  57. [57]

    Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned

    Horawalavithana, Sameera and Ayton, Ellyn and Sharma, Shivam and Howland, Scott and Subramanian, Megha and Vasquez, Scott and Cosbey, Robin and Glenski, Maria and Volkova, Svitlana. Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned. Proceedings of BigScience Episode \# 5 -- Workshop on Challenges & Pers...

  58. [58]

    Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022

  59. [59]

    Using Item Response Theory to Measure Gender and Racial Bias of a BERT -based Automated E nglish Speech Assessment System

    Kwako, Alexander and Wan, Yixin and Zhao, Jieyu and Chang, Kai-Wei and Cai, Li and Hansen, Mark. Using Item Response Theory to Measure Gender and Racial Bias of a BERT -based Automated E nglish Speech Assessment System. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.1

  60. [60]

    Automatic scoring of short answers using justification cues estimated by BERT

    Takano, Shunya and Ichikawa, Osamu. Automatic scoring of short answers using justification cues estimated by BERT. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.2

  61. [61]

    Mitigating Learnerese Effects for CEFR Classification

    Jalota, Rricha and Bourgonje, Peter and Van Sas, Jan and Huang, Huiyan. Mitigating Learnerese Effects for CEFR Classification. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.3

  62. [62]

    Automatically Detecting Reduced-formed E nglish Pronunciations by Using Deep Learning

    Chen, Lei and Jiang, Chenglin and Gu, Yiwei and Liu, Yang and Yuan, Jiahong. Automatically Detecting Reduced-formed E nglish Pronunciations by Using Deep Learning. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.4

  63. [63]

    A Baseline Readability Model for C ebuano

    Imperial, Joseph Marvin and Reyes, Lloyd Lois Antonie and Ibanez, Michael Antonio and Sapinit, Ranz and Hussien, Mohammed. A Baseline Readability Model for C ebuano. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.5

  64. [64]

    Generation of Synthetic Error Data of Verb Order Errors for S wedish

    Casademont Moner, Judit and Volodina, Elena. Generation of Synthetic Error Data of Verb Order Errors for S wedish. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.6

  65. [65]

    A Dependency Treebank of Spoken Second Language E nglish

    Kyle, Kristopher and Eguchi, Masaki and Miller, Aaron and Sither, Theodore. A Dependency Treebank of Spoken Second Language E nglish. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.7

  66. [66]

    Starting from ``Zero'': An Incremental Zero-shot Learning Approach for Assessing Peer Feedback Comments

    Jia, Qinjin and Cao, Yupeng and Gehringer, Edward. Starting from ``Zero'': An Incremental Zero-shot Learning Approach for Assessing Peer Feedback Comments. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.8

  67. [67]

    On Assessing and Developing Spoken `Grammatical Error Correction' Systems

    Lu, Yiting and Bann \`o , Stefano and Gales, Mark. On Assessing and Developing Spoken `Grammatical Error Correction' Systems. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.9

  68. [68]

    Automatic True/False Question Generation for Educational Purpose

    Zou, Bowei and Li, Pengfei and Pan, Liangming and Aw, Ai Ti. Automatic True/False Question Generation for Educational Purpose. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.10

  69. [69]

    and Sumner, Tamara

    Suresh, Abhijit and Jacobs, Jennifer and Perkoff, Margaret and Martin, James H. and Sumner, Tamara. Fine-tuning Transformers with Additional Context to Classify Discursive Moves in Mathematics Classrooms. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.11

  70. [70]

    Cross-corpora experiments of automatic proficiency assessment and error detection for spoken E nglish

    Bann \`o , Stefano and Matassoni, Marco. Cross-corpora experiments of automatic proficiency assessment and error detection for spoken E nglish. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.12

  71. [71]

    Activity focused Speech Recognition of Preschool Children in Early Childhood Classrooms

    Dutta, Satwik and Irvin, Dwight and Buzhardt, Jay and Hansen, John H.L. Activity focused Speech Recognition of Preschool Children in Early Childhood Classrooms. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.13

  72. [72]

    Structural information in mathematical formulas for exercise difficulty prediction: a comparison of NLP representations

    Loginova, Ekaterina and Benoit, Dries. Structural information in mathematical formulas for exercise difficulty prediction: a comparison of NLP representations. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.14

  73. [73]

    The Specificity and Helpfulness of Peer-to-Peer Feedback in Higher Education

    Rietsche, Roman and Caines, Andrew and Schramm, Cornelius and Pf. The Specificity and Helpfulness of Peer-to-Peer Feedback in Higher Education. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.15

  74. [74]

    Similarity-Based Content Scoring - How to Make S - BERT Keep Up With BERT

    Bexte, Marie and Horbach, Andrea and Zesch, Torsten. Similarity-Based Content Scoring - How to Make S - BERT Keep Up With BERT. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.16

  75. [75]

    Don ' t Drop the Topic - The Role of the Prompt in Argument Identification in Student Writing

    Ding, Yuning and Bexte, Marie and Horbach, Andrea. Don ' t Drop the Topic - The Role of the Prompt in Argument Identification in Student Writing. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.17

  76. [76]

    ALEN App: Argumentative Writing Support To Foster E nglish Language Learning

    Wambsganss, Thiemo and Caines, Andrew and Buttery, Paula. ALEN App: Argumentative Writing Support To Foster E nglish Language Learning. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.18

  77. [77]

    Assessing sentence readability for G erman language learners with broad linguistic modeling or readability formulas: When do linguistic insights make a difference?

    Weiss, Zarah and Meurers, Detmar. Assessing sentence readability for G erman language learners with broad linguistic modeling or readability formulas: When do linguistic insights make a difference?. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.19

  78. [78]

    Parametrizable exercise generation from authentic texts: Effectively targeting the language means on the curriculum

    Heck, Tanja and Meurers, Detmar. Parametrizable exercise generation from authentic texts: Effectively targeting the language means on the curriculum. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.20

  79. [79]

    Selecting Context Clozes for Lightweight Reading Compliance

    Keim, Greg and Littman, Michael. Selecting Context Clozes for Lightweight Reading Compliance. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.21

  80. [80]

    `Meet me at the ribary' -- Acceptability of spelling variants in free-text answers to listening comprehension prompts

    Laarmann-Quante, Ronja and Schwarz, Leska and Horbach, Andrea and Zesch, Torsten. `Meet me at the ribary' -- Acceptability of spelling variants in free-text answers to listening comprehension prompts. Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 2022. doi:10.18653/v1/2022.bea-1.22

Showing first 80 references.