Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

Alexander Boesgaard Lorup (Openhagen)

arxiv: 2605.19092 · v1 · pith:H5TEFLKSnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

Alexander Boesgaard Lorup (Openhagen) This is my paper

Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords counterfactual likelihoodindirect influenceprivate reasoning channelsgraph separationrole-visibility masknegative log likelihoodpublic channel pathwayinfluence measurement

0 comments

The pith

Graph separation controls produce bit-identical scores, showing public channels fully carry indirect private influence under role masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a counterfactual likelihood test that replaces an upstream private block with a length-matched donor block, keeps the public token sequence and downstream target fixed, and measures the resulting negative-log-likelihood shift. This isolates indirect influence through public channels while controlling for positional confounds. On a 7B role-channel model, textual probes prove unreliable for detecting leakage, but the likelihood method cleanly separates masked and unmasked conditions and reveals asymmetric influence that persists from A to B through public hidden states but not in reverse. Validation across three checkpoints, five seeds, and 13,734 directional contrasts replicates the pattern. A graph-separation control that blocks private-to-public carrier edges yields identical natural and counterfactual scores in every case, establishing the public pathway as the complete carrier of the measured signal.

Core claim

Under the implemented role-visibility mask, the tested public-channel pathway is the complete carrier of the measured counterfactual signal. This is shown by a graph-separation control that blocks private-to-public carrier edges and produces bit-identical natural and counterfactual negative-log-likelihood scores across all 13,734 valid directional contrasts. The test replaces upstream private blocks with length-matched donor blocks to isolate influence on downstream targets while holding public sequences fixed.

What carries the argument

Counterfactual likelihood test that replaces an upstream private reasoning block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the negative-log-likelihood shift on the target.

If this is right

Private-channel evaluation should report direct and indirect influence separately.
Counterfactual likelihood probes provide a practical default for measuring influence boundaries where textual methods fail.
Influence is asymmetric: A-to-B persists through public-speech hidden states while reverse B-to-A influence is near zero.
The public-channel pathway accounts for the complete counterfactual signal under the role-visibility mask.
The asymmetry and pathway identification replicate across multiple checkpoints and seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The test could be applied to audit information flow in other role-based or multi-agent systems that separate private and public computation.
Directional asymmetries may affect coordination in masked multi-role setups and warrant checks in larger models.
Focusing evaluation on public outputs alone may suffice once the carrier pathway is verified.

Load-bearing premise

Length matching of donor blocks sufficiently controls for RoPE positional encoding confounds when measuring the negative-log-likelihood shift on the downstream target.

What would settle it

Observing different natural and counterfactual scores after blocking private-to-public carrier edges in the graph-separation control would show that the public-channel pathway does not fully account for the measured signal.

Figures

Figures reproduced from arXiv: 2605.19092 by Alexander Boesgaard Lorup (Openhagen).

**Figure 1.** Figure 1: Counterfactual influence test design. Public tokens and the target continuation are held fixed; only the upstream private block is replaced. 3 Counterfactual Influence Let X be an upstream private block, P the public token sequence that follows it, and T the downstream target continuation. Let X ′ be a donor private block from another sample. We compare two prefixes: Cnat = [context, X, P], (1) Ccf = [cont… view at source ↗

**Figure 2.** Figure 2: Probe families differ sharply in diagnostic value. Raw 4-gram overlap was measured only on the masked variant. 4.2 Corrected n-gram overlap The second probe subtracted n-grams already present in the prompt and prior public utterances. This moved in the right direction but remained too noisy. The unmasked baseline scored 64%, while the masked variant scored 54%. A 10-point gap is useful as a diagnostic but … view at source ↗

**Figure 3.** Figure 3: Directional influence after length matching on the masked variant. The reverse channel is near zero; A-to-B influence persists through public-speech hidden states. private tokens. The likelihood shift can therefore measure position, not content. The symptom was an A-to-B influence rate of 22.2% despite blocked direct access from B to A’s private tokens. The top-delta examples had large donor/original lengt… view at source ↗

**Figure 4.** Figure 4: Directional influence rates across three checkpoints from the same training lineage. A-to-B influence exceeds B-to-A by a factor of 5.6 to 10.3 in every checkpoint; absolute magnitudes vary, the asymmetric pattern is consistent. Clustering check. Because the directional contrasts are clustered by checkpoint, seed, trace, task, and donor pool, the Wilson 95% confidence intervals above treat contrasts as ind… view at source ↗

read the original abstract

Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The counterfactual replacement test plus graph control gives a cleaner read on indirect influence than text probes, though the RoPE length-matching step still looks like the weakest link.

read the letter

This paper introduces a counterfactual likelihood test to measure indirect influence between private reasoning channels in models that separate computation. The core move is swapping an upstream private block for a length-matched donor while keeping the public tokens and downstream target the same, then tracking the shift in negative log likelihood. On a 7B role-channel model it separates unmasked and masked conditions where raw overlap and canary checks do not. The graph-separation control that blocks private-to-public edges produces bit-identical natural and counterfactual scores across all 13,734 cases, which is the strongest piece of evidence they present. It also shows clear directional asymmetry, with A-to-B influence persisting through public hidden states while reverse influence stays near zero, and they replicate across checkpoints and seeds. That combination of replacement, NLL shift, and explicit separation validation is what is new here and where the work is most useful. The length-matching step is offered as a control for RoPE positional confounds, but that assumption is not obviously airtight. Swapping a donor block into a fixed public sequence could still shift absolute or relative positions enough to affect attention patterns and downstream likelihoods beyond pure content influence. Without more explicit checks on attention or position encodings in the results, some of the measured asymmetry might carry residual positional leakage. The work is aimed at people auditing information flow or privacy boundaries in separated reasoning systems. Anyone testing channel isolation for safety or evaluation purposes would get practical value from trying the probe. I would send it for peer review so the methods section and any additional positional diagnostics get proper scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces a counterfactual likelihood test for measuring indirect influence between private reasoning channels in LLMs. It replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the resulting shift in negative log-likelihood on the target. Validation on a 7B role-channel model shows textual probes (n-gram overlap, canary reproduction) are unreliable, while the likelihood test separates unmasked and masked conditions and detects asymmetric influence (persistent A-to-B via public hidden states, near-zero B-to-A). A graph-separation control blocking private-to-public edges yields bit-identical natural and counterfactual scores across all 13,734 contrasts, identifying the public-channel pathway as the complete carrier under the role-visibility mask. The asymmetry replicates across three checkpoints, five seeds, and the full set of directional contrasts.

Significance. If the central claims hold, the work supplies a practical, falsifiable method for auditing information flow across private-public boundaries in reasoning models and demonstrates that private-channel evaluation must separately report direct and indirect influence. Notable strengths include the bit-identical graph-separation control across 13k contrasts and the multi-checkpoint, multi-seed replication, both of which support reproducibility.

major comments (2)

[§4 (Counterfactual Likelihood Test and Length Matching)] §4 (Counterfactual Likelihood Test and Length Matching): the claim that length matching of donor blocks controls RoPE positional confounds lacks an explicit verification (e.g., ablation confirming preserved relative positional relationships or unchanged attention patterns beyond token length). This is load-bearing for attributing observed NLL shifts and the reported A-to-B asymmetry solely to content influence rather than residual positional leakage.
[§5 (Graph-Separation Control)] §5 (Graph-Separation Control): while bit-identical scores across 13,734 evaluations are reported, the manuscript does not test whether the result remains stable under modest changes to the role-visibility mask; dependence on the specific mask implementation weakens the generality of the 'complete carrier' conclusion.

minor comments (2)

[Abstract] Abstract: the phrase 'hardened masked validation' is used without a forward reference or brief definition; add a parenthetical pointer to the relevant subsection.
[Notation] Notation: ensure 'natural' and 'counterfactual' scores are defined once at first use and used consistently thereafter to avoid reader ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§4 (Counterfactual Likelihood Test and Length Matching)] §4 (Counterfactual Likelihood Test and Length Matching): the claim that length matching of donor blocks controls RoPE positional confounds lacks an explicit verification (e.g., ablation confirming preserved relative positional relationships or unchanged attention patterns beyond token length). This is load-bearing for attributing observed NLL shifts and the reported A-to-B asymmetry solely to content influence rather than residual positional leakage.

Authors: We appreciate the referee highlighting the need for explicit verification. Length matching ensures donor blocks occupy identical sequence positions to the originals, preserving the positional indices applied to the fixed public token sequence under RoPE. To provide the requested confirmation, we will add an ablation in the revised manuscript that compares attention head patterns and NLL shifts between length-matched and non-matched donor blocks. This will isolate content-driven effects from any residual positional contributions and directly support attribution of the A-to-B asymmetry. revision: yes
Referee: [§5 (Graph-Separation Control)] §5 (Graph-Separation Control): while bit-identical scores across 13,734 evaluations are reported, the manuscript does not test whether the result remains stable under modest changes to the role-visibility mask; dependence on the specific mask implementation weakens the generality of the 'complete carrier' conclusion.

Authors: We agree that the 'complete carrier' conclusion is scoped to the specific role-visibility mask implemented in the model. The bit-identical natural and counterfactual scores across all 13,734 contrasts rigorously demonstrate that, under this mask, the public-channel pathway accounts for the entire measured signal with no residual private-to-public leakage. We did not vary the mask, as the experiment was designed to validate the counterfactual test within the model's fixed architecture. In revision we will expand the discussion of the mask definition, explicitly state the scope of the claim, and note the desirability of mask-variation tests as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit interventions and empirical measurements

full rationale

The paper's method replaces an upstream private block with a length-matched donor while holding the public sequence fixed, then applies a graph-separation control that blocks private-to-public edges. The central claim follows from the observed bit-identical natural and counterfactual scores across 13,734 evaluations, which directly indicates that the measured NLL shift is carried exclusively by the public pathway under the role-visibility mask. This is an empirical outcome of the intervention rather than a quantity defined in terms of itself or a fitted parameter renamed as a prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the result; the length-matching control is stated as an experimental assumption for RoPE confounds but does not create a definitional loop. The derivation chain from test construction to conclusion about complete carrier status remains self-contained and falsifiable via the reported contrasts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer likelihood computation and the assumption that role-visibility masks cleanly separate channels; no new physical constants or ad-hoc fitted scales are introduced in the abstract.

axioms (1)

domain assumption Role-visibility masks cleanly isolate private from public token streams in the evaluated model.
Invoked when interpreting the graph-separation control and the persistence of A-to-B influence through public hidden states.

pith-pipeline@v0.9.0 · 5783 in / 1232 out tokens · 28730 ms · 2026-05-20T12:59:14.225147+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces an upstream private block with a donor block, holds the public token sequence and downstream target fixed, and measures the downstream target’s negative-log-likelihood shift
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 10 internal anchors

[1]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate.arXiv:2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Irving, G., Christiano, P., and Amodei, D. (2018). AI Safety via Debate.arXiv:1805.00899

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Korbak, T., Balesni, M., Barnes, E., Bengio, Y ., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.arXiv:2507.11473

work page internal anchor Pith review arXiv 2025
[4]

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning.arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Meng, K., Bau, D., Andonian, A., and Belinkov, Y . (2022). Locating and Editing Factual Associations in GPT. arXiv:2202.05262. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Su, G., Yang, Y ., Li, X., and Geiping, J. (2026). Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs.arXiv:2605.12460

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.arXiv:2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Vig, J., Gehrmann, S., Belinkov, Y ., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y ., and Shieber, S. (2020). Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias.arXiv:2004.12265

work page arXiv 2020
[9]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self- Consistency Improves Chain of Thought Reasoning in Language Models.arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.arXiv:2305.10601. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate.arXiv:2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Irving, G., Christiano, P., and Amodei, D. (2018). AI Safety via Debate.arXiv:1805.00899

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Korbak, T., Balesni, M., Barnes, E., Bengio, Y ., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.arXiv:2507.11473

work page internal anchor Pith review arXiv 2025

[4] [4]

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning.arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Meng, K., Bau, D., Andonian, A., and Belinkov, Y . (2022). Locating and Editing Factual Associations in GPT. arXiv:2202.05262. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Su, G., Yang, Y ., Li, X., and Geiping, J. (2026). Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs.arXiv:2605.12460

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.arXiv:2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Vig, J., Gehrmann, S., Belinkov, Y ., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y ., and Shieber, S. (2020). Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias.arXiv:2004.12265

work page arXiv 2020

[9] [9]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self- Consistency Improves Chain of Thought Reasoning in Language Models.arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.arXiv:2305.10601. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023