COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

Arya Fayyazi; Massoud Pedram; Mehdi Kamal

arxiv: 2605.30641 · v1 · pith:63L2WCSDnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

Arya Fayyazi , Mehdi Kamal , Massoud Pedram This is my paper

Pith reviewed 2026-06-29 07:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords bias mitigationchain-of-thoughtconformal predictioncounterfactual promptingfairness in LLMsdecoding methodslarge language models

0 comments

The pith

COFT uses counterfactual masking and dual-branch split-conformal calibration at decode time to cut bias metrics in LLM chain-of-thought by 30-55% while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COFT as a training-free decoding method that intervenes during chain-of-thought generation to reduce attribute-driven biases. It forms a masked counterfactual prompt by neutral replacement of sensitive spans, fuses the factual and counterfactual logit distributions, and applies split-conformal calibration on the dual branches to certify per-step token candidate sets. This produces 30-55% reductions in standard bias metrics across six models and multiple benchmarks, with task accuracy and language quality unchanged within noise margins and only one extra cached forward pass of overhead. A sympathetic reader would care because the approach supplies distribution-free marginal validity guarantees under exchangeability for any frozen causal language model without requiring retraining or weight access.

Core claim

COFT operates in three stages on any frozen causal language model: a masked counterfactual prompt is created by replacing sensitive spans with neutral tokens, factual and masked logit distributions are compared through lightweight fusion to attenuate attribute-driven biases, and dual-branch split-conformal calibration certifies per-step candidate token sets at a user-chosen risk level. When evaluated across six models and multiple bias benchmarks the procedure reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality, with reasoning accuracies remaining unchanged within run-to-run noise margins and computational overhead equivalent to one additio

What carries the argument

Dual-branch split-conformal calibration on fused factual-counterfactual logit sequences, which produces per-step certified candidate token sets with distribution-free marginal validity guarantees under exchangeability.

If this is right

Standard bias metrics fall by 30-55% (median 38%) on multiple benchmarks.
Reasoning accuracy and language quality remain unchanged within run-to-run noise.
The method applies to any frozen causal language model without retraining or auxiliary classifiers.
Overhead equals one additional cached forward pass at most 11%.
Per-step candidate sets supply auditable coverage guarantees at a user-chosen risk level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The certified token sets could be logged to audit exactly which tokens are excluded on fairness grounds during generation.
If exchangeability holds across new prompt distributions the same calibration could be reused for non-CoT generation tasks.
Dynamic adjustment of the risk level mid-generation becomes possible because calibration is performed per step.
The approach could be stacked with temperature or top-p sampling without invalidating the coverage statements.

Load-bearing premise

The factual and masked counterfactual logit sequences satisfy the exchangeability condition required for the split-conformal procedure to deliver valid per-step coverage guarantees.

What would settle it

Observing that the true next token falls outside the certified candidate set more often than the target risk level on a held-out exchangeable dataset would falsify the coverage guarantee.

Figures

Figures reproduced from arXiv: 2605.30641 by Arya Fayyazi, Massoud Pedram, Mehdi Kamal.

**Figure 1.** Figure 1: Overview of the COFT framework. Sensitive input spans are masked in the auxiliary branch, while downstream predictions are regularized through fused logits and dual-branch certification. 2. Background and Related Work 2.1. Bias in Large Language Models Large-scale LMs inherit societal biases from web corpora, manifesting as toxicity, stereotypes, and disparate treatment across protected attributes (Bender… view at source ↗

**Figure 2.** Figure 2: Overview of the workflow. The masked prompt is initialized once; at each decoding step both branches use the same generated prefix and receive the same selected token. Counterfactual Fairness Target (Token-Level). Let v ⋆ t denote the ground-truth next token at step t under the factual world. We say a decoder is token-level counterfactually stable at level α if, for every step t, the set of eligible next t… view at source ↗

**Figure 3.** Figure 3: Ablation: λ. Validation Pareto; we pick the smallest λ within 2% of the knee (here λ≈0.6). 0.14 0.15 0.15 0.16 0.16 0.17 0.17 67 67.5 68 α=0.05 0.10 0.20 BiasAvg ↓ UtilityAvg ↑ α sweep (val) chosen knee [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Long-form summarization quality vs. length. MAUVE on CNN/DailyMail as we vary max generation length T. MAUVE decreases slightly for both vanilla and COFT as generations get longer, and COFT stays within 0.01 of vanilla for all T (gap ≤ 0.008 at T=1024), indicating no systematic degradation in long-form quality [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Runtime scaling of COFT vs. vanilla decoding. COFT introduces a predictable ≈10% overhead from the additional masked pass, with tight confidence bands indicating stable performance across windows [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Model-wise ablations of λ and α. Top: bias–utility vs. λ (stars: selected λ). Bottom: miscoverage and normalized candidate-set size vs. α (stars: selected α) [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of masking vs. COFT (Mistral-7B-Instruct). Masking alone reduces explicit bias but COFT further attenuates attribute-driven preferences and certifies joint plausibility (lower is better) [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Sequence-level (max-st) miscoverage vs. target. The empirical sequence-level miscoverage is slightly conservative relative to the nominal target α, reflecting the max aggregation across steps; a per-step union bound yields a similar upper envelope. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

read the original abstract

Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COFT gives a workable inference-time bias fix for CoT via masking plus conformal calibration, but the coverage claim depends on an untested exchangeability assumption that masking likely breaks.

read the letter

COFT is a training-free decoding method that masks sensitive spans to build a counterfactual prompt, fuses logits from the factual and masked branches to dampen biased tokens, and runs dual-branch split-conformal calibration to pick token sets at a chosen risk level. The combination of those three pieces appears new.

The paper does the practical side cleanly. It reports 30-55% drops in standard bias metrics across six models and several benchmarks, with task accuracy and output quality staying inside run-to-run noise and only one extra cached forward pass of overhead. Those numbers are useful for anyone who needs to add fairness control without retraining or weight access.

The soft spot is the validity guarantee. The dual-branch conformal procedure requires the factual and masked logit sequences to be exchangeable so that the conformity scores deliver the stated marginal coverage. Masking changes the context and therefore the next-token distribution, which creates a systematic shift that is unlikely to preserve exchangeability over multi-step CoT trajectories. No diagnostic (rank uniformity or otherwise) or derivation is given to support the assumption, so the distribution-free claim does not follow from the standard theorem. The empirical bias reduction stands, but the theoretical part is weaker than presented.

This is for practitioners who want a plug-in fairness layer on frozen models. It deserves peer review because the experiments are concrete and the method is implementable, even though referees will need to examine the exchangeability step and the exact fusion rule.

Referee Report

1 major / 2 minor

Summary. The paper introduces COFT, a training-free decoding method for mitigating societal biases in chain-of-thought (CoT) reasoning of frozen causal LLMs. It proceeds in three stages: (1) constructing a masked counterfactual prompt by replacing sensitive spans with neutral tokens, (2) performing lightweight logit fusion between factual and counterfactual branches to attenuate attribute-driven biases, and (3) applying dual-branch split-conformal calibration to produce per-step candidate token sets with distribution-free marginal validity guarantees (under an exchangeability assumption) at a user-chosen risk level α. Experiments across six models and multiple bias benchmarks report 30-55% (median 38%) reductions in standard bias metrics while preserving task accuracy and language quality, with computational overhead equivalent to one additional cached forward pass (≤11%).

Significance. If the central validity claim holds, COFT would constitute a practical advance: a post-hoc, auditable fairness intervention applicable to any frozen LLM without retraining, auxiliary classifiers, or weight access. The combination of counterfactual masking with split-conformal prediction for per-step coverage, together with the reported empirical bias reductions and negligible utility loss, would be a notable contribution to controllable decoding. The training-free nature and modest overhead are explicit strengths.

major comments (1)

[dual-branch split-conformal calibration] The dual-branch split-conformal procedure (abstract and §3): the distribution-free marginal validity guarantees are stated to rest on the exchangeability of the factual and masked-counterfactual logit sequences. Replacing sensitive spans with neutral tokens induces a systematic distributional shift in context and next-token statistics; this shift is especially pronounced across multi-step CoT trajectories. No derivation, justification, or diagnostic (e.g., uniformity of conformity-score ranks on held-out data) is supplied to support that the two branches remain exchangeable. Without this, the coverage guarantees do not follow from the standard split-conformal theorem and the central theoretical claim is unsupported.

minor comments (2)

[evaluation] The abstract states that 'reasoning accuracies remain unchanged within run-to-run noise margins'; the corresponding experimental section should report the exact number of runs, the precise definition of 'noise margins,' and the statistical test used.
[method] Notation for the logit-fusion step should be made fully explicit (e.g., the precise functional form of the fusion operator and any temperature or weighting hyperparameters) so that the method is reproducible from the text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the theoretical underpinnings of COFT. The single major comment concerns the justification for the exchangeability assumption in the dual-branch split-conformal calibration. We address it point-by-point below and commit to revisions that strengthen the manuscript without misrepresenting the current claims.

read point-by-point responses

Referee: The dual-branch split-conformal procedure (abstract and §3): the distribution-free marginal validity guarantees are stated to rest on the exchangeability of the factual and masked-counterfactual logit sequences. Replacing sensitive spans with neutral tokens induces a systematic distributional shift in context and next-token statistics; this shift is especially pronounced across multi-step CoT trajectories. No derivation, justification, or diagnostic (e.g., uniformity of conformity-score ranks on held-out data) is supplied to support that the two branches remain exchangeable. Without this, the coverage guarantees do not follow from the standard split-conformal theorem and the central theoretical claim is unsupported.

Authors: We appreciate the referee's precise identification of this gap. The manuscript conditions all coverage guarantees on the exchangeability assumption (abstract and §3) but does not derive why the specific masking strategy preserves it or supply supporting diagnostics. This is a legitimate weakness in the current presentation. In the revised manuscript we will add a new subsection in §3 that (i) articulates the design intent of neutral-token masking—to neutralize attribute-driven components of the next-token distribution while leaving task-relevant statistics largely intact; (ii) provides a brief argument that, when sensitive spans are the primary source of non-exchangeability, the resulting conformity-score sequences satisfy the required exchangeability for marginal validity; and (iii) reports empirical rank-uniformity diagnostics on held-out calibration sets across the evaluated models and benchmarks. These additions will make the theoretical claim explicit, auditable, and directly responsive to the concern without altering the method or reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validity guarantees invoke external conformal prediction theorem under stated assumption

full rationale

The paper's central claim of distribution-free marginal validity rests on the standard split-conformal theorem applied to factual and masked-counterfactual logit sequences under an exchangeability assumption. This is an external mathematical result, not derived from or reduced to quantities fitted inside the paper. No equations show a prediction reducing to a fit by construction, no self-citation chain is load-bearing for the guarantee, and no ansatz or renaming is smuggled in. The bias-reduction results are presented as empirical evaluations across models and benchmarks. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the exchangeability assumption for conformal validity and on the heuristic effectiveness of the logit fusion step; no new entities are postulated.

free parameters (1)

user-chosen risk level alpha
Controls the coverage guarantee of the conformal sets; chosen by the user rather than fitted to data.

axioms (1)

domain assumption Exchangeability of factual and counterfactual token sequences or logits
Invoked to obtain distribution-free marginal validity guarantees for the per-step candidate token sets.

pith-pipeline@v0.9.1-grok · 5747 in / 1405 out tokens · 28317 ms · 2026-06-29T07:15:01.198880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages · 2 internal anchors

[1]

URL https:// doi.org/10.18653/v1/p19-1346

doi: 10.18653/v1/P19-1346. Fayyazi, A. and Akrami, H. Proof-of-perception: Certified tool-using multimodal reasoning with compositional conformal guarantees, 2026. URL https://arxiv.org/abs/2603.00324. Fayyazi, A., Kamal, M., and Pedram, M. FACTER: Fairness-aware conformal thresholding and prompt engineering for enabling fair LLM-based recommender systems...

work page doi:10.18653/v1/p19-1346 2026
[2]

Efficient Attentions for Long Document Summarization

doi: 10.18653/v1/2021.naacl-main.112. Hugging Face. The Hugging Face Hub. Online repository,

work page doi:10.18653/v1/2021.naacl-main.112 2021
[3]

Mistral 7B

URLhttps://huggingface.co/. 10 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models ICT Institute. Utrecht fairness recruitment dataset. Kaggle dataset, 2022. URLhttps://www.kaggle.com/ datasets/ictinstitute/ utrecht-fairness-recruitment-dataset. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chapl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.findings-emnlp.424 2022
[4]

Pointer Sentinel Mixture Models

doi: 10.1007/978-3-030-62077-6 14. Madotto, A., Ishii, E., Lin, Z., Dathathri, S., and Fung, P. Plug-and-play conversational models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 2422–2433. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.219. Merity, S., Xiong, C., Bradbury, J., and Soch...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-62077-6 2020
[5]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

doi: 10.18653/v1/2021.acl-long.416. Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Association for Computational Linguistics,

work page doi:10.18653/v1/2021.acl-long.416 2021
[6]

11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S

doi: 10.18653/v1/K16-1028. 11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/k16-1028 2020
[7]

context 7→(bπt, πCF t )7→s t(·)

doi: 10.18653/v1/D18-1521. Zhao, J., Khashabi, D., Khot, T., Sabharwal, A., and Chang, K.-W. Ethical-advice taker: Do language models understand natural language interventions? InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4158–4164. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-acl.365...

work page doi:10.18653/v1/d18-1521 2021
[8]

These methods compete oncomputation at decode-time, counterfactual consistency, and statistical guarantees

Frozen-weights, inference-time debiasing(primary): Vanilla, SDD, DExperts-style steering, safety templates, detox decoding, CF substitution, DT-CD. These methods compete oncomputation at decode-time, counterfactual consistency, and statistical guarantees
[9]

We report themseparately(Appendix only) to avoid conflating training cost with inference-only objectives

Train-time methods(secondary): CDA and adversarial LM-head. We report themseparately(Appendix only) to avoid conflating training cost with inference-only objectives
[10]

attenuate then certify

Model and dataset coverage: six recent open LMs across six bias benchmarks + four utility tasks, as enumerated in §4.1. Main-text reports two representative models to respect page limits; full grids are here. This design yieldsorthogonalstress-tests for (i) bias mitigation breadth, (ii) task/quality preservation, and (iii) effi- ciency/scaling, mirroring ...

2021

[1] [1]

URL https:// doi.org/10.18653/v1/p19-1346

doi: 10.18653/v1/P19-1346. Fayyazi, A. and Akrami, H. Proof-of-perception: Certified tool-using multimodal reasoning with compositional conformal guarantees, 2026. URL https://arxiv.org/abs/2603.00324. Fayyazi, A., Kamal, M., and Pedram, M. FACTER: Fairness-aware conformal thresholding and prompt engineering for enabling fair LLM-based recommender systems...

work page doi:10.18653/v1/p19-1346 2026

[2] [2]

Efficient Attentions for Long Document Summarization

doi: 10.18653/v1/2021.naacl-main.112. Hugging Face. The Hugging Face Hub. Online repository,

work page doi:10.18653/v1/2021.naacl-main.112 2021

[3] [3]

Mistral 7B

URLhttps://huggingface.co/. 10 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models ICT Institute. Utrecht fairness recruitment dataset. Kaggle dataset, 2022. URLhttps://www.kaggle.com/ datasets/ictinstitute/ utrecht-fairness-recruitment-dataset. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chapl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.findings-emnlp.424 2022

[4] [4]

Pointer Sentinel Mixture Models

doi: 10.1007/978-3-030-62077-6 14. Madotto, A., Ishii, E., Lin, Z., Dathathri, S., and Fung, P. Plug-and-play conversational models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 2422–2433. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.219. Merity, S., Xiong, C., Bradbury, J., and Soch...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-62077-6 2020

[5] [5]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

doi: 10.18653/v1/2021.acl-long.416. Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Association for Computational Linguistics,

work page doi:10.18653/v1/2021.acl-long.416 2021

[6] [6]

11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S

doi: 10.18653/v1/K16-1028. 11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/k16-1028 2020

[7] [7]

context 7→(bπt, πCF t )7→s t(·)

doi: 10.18653/v1/D18-1521. Zhao, J., Khashabi, D., Khot, T., Sabharwal, A., and Chang, K.-W. Ethical-advice taker: Do language models understand natural language interventions? InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4158–4164. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-acl.365...

work page doi:10.18653/v1/d18-1521 2021

[8] [8]

These methods compete oncomputation at decode-time, counterfactual consistency, and statistical guarantees

Frozen-weights, inference-time debiasing(primary): Vanilla, SDD, DExperts-style steering, safety templates, detox decoding, CF substitution, DT-CD. These methods compete oncomputation at decode-time, counterfactual consistency, and statistical guarantees

[9] [9]

We report themseparately(Appendix only) to avoid conflating training cost with inference-only objectives

Train-time methods(secondary): CDA and adversarial LM-head. We report themseparately(Appendix only) to avoid conflating training cost with inference-only objectives

[10] [10]

attenuate then certify

Model and dataset coverage: six recent open LMs across six bias benchmarks + four utility tasks, as enumerated in §4.1. Main-text reports two representative models to respect page limits; full grids are here. This design yieldsorthogonalstress-tests for (i) bias mitigation breadth, (ii) task/quality preservation, and (iii) effi- ciency/scaling, mirroring ...

2021