Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3
The pith
Graph separation controls produce bit-identical scores, showing public channels fully carry indirect private influence under role masks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the implemented role-visibility mask, the tested public-channel pathway is the complete carrier of the measured counterfactual signal. This is shown by a graph-separation control that blocks private-to-public carrier edges and produces bit-identical natural and counterfactual negative-log-likelihood scores across all 13,734 valid directional contrasts. The test replaces upstream private blocks with length-matched donor blocks to isolate influence on downstream targets while holding public sequences fixed.
What carries the argument
Counterfactual likelihood test that replaces an upstream private reasoning block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the negative-log-likelihood shift on the target.
If this is right
- Private-channel evaluation should report direct and indirect influence separately.
- Counterfactual likelihood probes provide a practical default for measuring influence boundaries where textual methods fail.
- Influence is asymmetric: A-to-B persists through public-speech hidden states while reverse B-to-A influence is near zero.
- The public-channel pathway accounts for the complete counterfactual signal under the role-visibility mask.
- The asymmetry and pathway identification replicate across multiple checkpoints and seeds.
Where Pith is reading between the lines
- The test could be applied to audit information flow in other role-based or multi-agent systems that separate private and public computation.
- Directional asymmetries may affect coordination in masked multi-role setups and warrant checks in larger models.
- Focusing evaluation on public outputs alone may suffice once the carrier pathway is verified.
Load-bearing premise
Length matching of donor blocks sufficiently controls for RoPE positional encoding confounds when measuring the negative-log-likelihood shift on the downstream target.
What would settle it
Observing different natural and counterfactual scores after blocking private-to-public carrier edges in the graph-separation control would show that the public-channel pathway does not fully account for the measured signal.
Figures
read the original abstract
Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a counterfactual likelihood test for measuring indirect influence between private reasoning channels in LLMs. It replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the resulting shift in negative log-likelihood on the target. Validation on a 7B role-channel model shows textual probes (n-gram overlap, canary reproduction) are unreliable, while the likelihood test separates unmasked and masked conditions and detects asymmetric influence (persistent A-to-B via public hidden states, near-zero B-to-A). A graph-separation control blocking private-to-public edges yields bit-identical natural and counterfactual scores across all 13,734 contrasts, identifying the public-channel pathway as the complete carrier under the role-visibility mask. The asymmetry replicates across three checkpoints, five seeds, and the full set of directional contrasts.
Significance. If the central claims hold, the work supplies a practical, falsifiable method for auditing information flow across private-public boundaries in reasoning models and demonstrates that private-channel evaluation must separately report direct and indirect influence. Notable strengths include the bit-identical graph-separation control across 13k contrasts and the multi-checkpoint, multi-seed replication, both of which support reproducibility.
major comments (2)
- [§4 (Counterfactual Likelihood Test and Length Matching)] §4 (Counterfactual Likelihood Test and Length Matching): the claim that length matching of donor blocks controls RoPE positional confounds lacks an explicit verification (e.g., ablation confirming preserved relative positional relationships or unchanged attention patterns beyond token length). This is load-bearing for attributing observed NLL shifts and the reported A-to-B asymmetry solely to content influence rather than residual positional leakage.
- [§5 (Graph-Separation Control)] §5 (Graph-Separation Control): while bit-identical scores across 13,734 evaluations are reported, the manuscript does not test whether the result remains stable under modest changes to the role-visibility mask; dependence on the specific mask implementation weakens the generality of the 'complete carrier' conclusion.
minor comments (2)
- [Abstract] Abstract: the phrase 'hardened masked validation' is used without a forward reference or brief definition; add a parenthetical pointer to the relevant subsection.
- [Notation] Notation: ensure 'natural' and 'counterfactual' scores are defined once at first use and used consistently thereafter to avoid reader ambiguity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§4 (Counterfactual Likelihood Test and Length Matching)] §4 (Counterfactual Likelihood Test and Length Matching): the claim that length matching of donor blocks controls RoPE positional confounds lacks an explicit verification (e.g., ablation confirming preserved relative positional relationships or unchanged attention patterns beyond token length). This is load-bearing for attributing observed NLL shifts and the reported A-to-B asymmetry solely to content influence rather than residual positional leakage.
Authors: We appreciate the referee highlighting the need for explicit verification. Length matching ensures donor blocks occupy identical sequence positions to the originals, preserving the positional indices applied to the fixed public token sequence under RoPE. To provide the requested confirmation, we will add an ablation in the revised manuscript that compares attention head patterns and NLL shifts between length-matched and non-matched donor blocks. This will isolate content-driven effects from any residual positional contributions and directly support attribution of the A-to-B asymmetry. revision: yes
-
Referee: [§5 (Graph-Separation Control)] §5 (Graph-Separation Control): while bit-identical scores across 13,734 evaluations are reported, the manuscript does not test whether the result remains stable under modest changes to the role-visibility mask; dependence on the specific mask implementation weakens the generality of the 'complete carrier' conclusion.
Authors: We agree that the 'complete carrier' conclusion is scoped to the specific role-visibility mask implemented in the model. The bit-identical natural and counterfactual scores across all 13,734 contrasts rigorously demonstrate that, under this mask, the public-channel pathway accounts for the entire measured signal with no residual private-to-public leakage. We did not vary the mask, as the experiment was designed to validate the counterfactual test within the model's fixed architecture. In revision we will expand the discussion of the mask definition, explicitly state the scope of the claim, and note the desirability of mask-variation tests as future work. revision: partial
Circularity Check
No significant circularity; derivation relies on explicit interventions and empirical measurements
full rationale
The paper's method replaces an upstream private block with a length-matched donor while holding the public sequence fixed, then applies a graph-separation control that blocks private-to-public edges. The central claim follows from the observed bit-identical natural and counterfactual scores across 13,734 evaluations, which directly indicates that the measured NLL shift is carried exclusively by the public pathway under the role-visibility mask. This is an empirical outcome of the intervention rather than a quantity defined in terms of itself or a fitted parameter renamed as a prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the result; the length-matching control is stated as an experimental assumption for RoPE confounds but does not create a definitional loop. The derivation chain from test construction to conclusion about complete carrier status remains self-contained and falsifiable via the reported contrasts.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Role-visibility masks cleanly isolate private from public token streams in the evaluated model.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
replaces an upstream private block with a donor block, holds the public token sequence and downstream target fixed, and measures the downstream target’s negative-log-likelihood shift
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate.arXiv:2305.14325
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Irving, G., Christiano, P., and Amodei, D. (2018). AI Safety via Debate.arXiv:1805.00899
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Korbak, T., Balesni, M., Barnes, E., Bengio, Y ., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.arXiv:2507.11473
work page internal anchor Pith review arXiv 2025
-
[4]
Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning.arXiv:2307.13702
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Meng, K., Bau, D., Andonian, A., and Belinkov, Y . (2022). Locating and Editing Factual Associations in GPT. arXiv:2202.05262. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Su, G., Yang, Y ., Li, X., and Geiping, J. (2026). Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs.arXiv:2605.12460
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.arXiv:2305.04388
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self- Consistency Improves Chain of Thought Reasoning in Language Models.arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.arXiv:2305.10601. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.