Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

Bryan Li; Jai Sharma; Yifan Wang

arxiv: 2605.20187 · v1 · pith:ZQDYVRYVnew · submitted 2026-01-27 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

Jai Sharma , Yifan Wang , Bryan Li This is my paper

Pith reviewed 2026-05-21 14:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT

keywords mutual information estimationmasked diffusion modelspairwise dependenciesparallel decodingsequence generationhidden state analysisself-supervised MI

0 comments

The pith

A neural network estimates the full pairwise mutual information matrix from hidden states of a masked diffusion model in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a neural framework that trains on hidden states of a pretrained masked diffusion model using the model's own computed pairwise MI values as supervision. This produces an estimator that outputs the entire MI matrix without needing to query every pair of conditional distributions at inference time. The approach lets the model reveal its learned beliefs about which variables depend on each other. In practice this identifies conditionally independent groups that can be decoded in parallel. Experiments on Sudoku and protein sequence tasks show the resulting MI maps match known structures and cut the number of required forward passes by a factor of three to five while keeping sample quality intact.

Core claim

We propose a neural framework for estimating pairwise conditional mutual information directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables.

What carries the argument

A neural estimator that takes the hidden states of the masked diffusion model as input and directly outputs the full pairwise mutual information matrix, trained with self-computed MI labels from the model's conditional distributions.

Load-bearing premise

The hidden states of the pretrained masked diffusion model contain enough information about pairwise conditional dependencies that a neural network can recover an accurate MI matrix without direct access to all conditional distributions.

What would settle it

Directly compute the true pairwise MI matrix from the model's full set of conditional distributions on the same inputs and compare it to the neural estimator output; large mismatch or failure of the estimated MI to produce valid parallel decoding without quality loss would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20187 by Bryan Li, Jai Sharma, Yifan Wang.

**Figure 1.** Figure 1: Visualization of the MDM’s beliefs before unmasking an 8 in the boxed square. Top Right: ‘1’ must lie in two cells of the bottom row, resulting in high MI (0.44 nats) conditionally independent given the context. To achieve this goal, we must move beyond the marginal predictions. The mutual information between positions xi and xj is the KL divergence between the joint distribution p(xi , xj ) and the produc… view at source ↗

**Figure 2.** Figure 2: Sudoku, accuracy vs NFE 5.3. Protein Sequences: MI-guided Parallelization Using a variety of techniques, we unconditionally generated 500 random proteins of lengths 50-100. We also embedded 500 reference samples from the UniRef50 database, of length 50-100. We then clustered them with k-means (fit on the reference, k=15), and computed their Jensen-Shannon divergence to the reference set. 4 [PITH_FULL_IMA… view at source ↗

**Figure 3.** Figure 3: PCA projections of ESM-C embeddings of various sampling methods. 6. Discussion Mutual information guided parallelization significantly reduces average forward passes in generation in masked diffusion models. Although this speedup generally comes with a degradation in the quality of generated samples, our experiments show a substantial improvement beyond naive parallelization, achieving efficiency gains … view at source ↗

read the original abstract

Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter-variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM-C, where the MI maps recover known structural constraints and enable a 3-5x magnitude reduction in inference-time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy-based parallelization methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The neural MI estimator from MDM hidden states is a clean engineering move for parallel decoding, but the pairwise independence assumption needs a direct check against higher-order effects.

read the letter

The paper trains a small network to read pairwise conditional mutual information straight out of the hidden states of a pretrained masked diffusion model. Supervision comes from MI values the base model can compute itself from its conditionals, so the estimator learns to approximate that information without running all the conditionals at inference time. Once trained, it outputs the full MI matrix in one pass and uses low-MI pairs to decide which tokens can be sampled together, cutting the number of forward passes by 3-5x on the Sudoku and ESM-C tasks while keeping sample quality comparable to sequential decoding and better than entropy baselines.

Referee Report

1 major / 2 minor

Summary. The paper proposes a neural network estimator that predicts the full matrix of pairwise conditional mutual information directly from the hidden states of a pretrained masked diffusion model. The estimator is trained using ground-truth pairwise MI values computed from the model's own conditional distributions as supervision. The resulting MI predictions are used to identify subsets of variables that can be sampled in parallel during decoding, yielding a claimed 3-5x reduction in the number of forward passes on Sudoku and ESM-C protein sequence tasks while recovering known structural constraints and preserving generative quality relative to sequential and entropy-based baselines.

Significance. If the estimator is accurate and the parallelization rule introduces no distributional bias, the work would provide both a practical tool for accelerating MDM inference and a window into the dependency structure learned by these models. The self-supervised training regime that re-uses the base model's conditionals is a notable strength, as is the empirical demonstration that the recovered MI maps align with domain-specific constraints in Sudoku and protein sequences.

major comments (1)

[Abstract and MI-guided parallel decoding procedure] The central efficiency claim rests on the assertion that thresholding the estimated pairwise conditional MI identifies subsets whose joint conditional distribution factorizes (Abstract and the MI-guided parallel decoding procedure). In a general discrete distribution, however, pairwise conditional independence does not imply joint conditional independence for |S| > 2; higher-order dependencies may remain. The reported experiments recover known pairwise structural constraints but do not include a direct diagnostic (KL or total-variation distance between the joint p(X_S | context) and the product of the individual conditionals) on the subsets actually selected for parallel steps. This verification is load-bearing for the claim that generative quality is preserved under the reported speedups.

minor comments (2)

[Abstract] The abstract states that the estimator yields 3-5x speedups but does not specify the precise baseline (sequential vs. entropy-based), the exact number of forward passes measured, or the quantitative quality metrics used to assert preservation of generative quality.
[Method] Additional detail on the neural estimator architecture, loss function, and training hyper-parameters would strengthen reproducibility, even if the full text contains them.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment regarding the justification for joint conditional independence from pairwise MI thresholding below, and we outline the changes we will make to strengthen the empirical support for our claims.

read point-by-point responses

Referee: The central efficiency claim rests on the assertion that thresholding the estimated pairwise conditional MI identifies subsets whose joint conditional distribution factorizes (Abstract and the MI-guided parallel decoding procedure). In a general discrete distribution, however, pairwise conditional independence does not imply joint conditional independence for |S| > 2; higher-order dependencies may remain. The reported experiments recover known pairwise structural constraints but do not include a direct diagnostic (KL or total-variation distance between the joint p(X_S | context) and the product of the individual conditionals) on the subsets actually selected for parallel steps. This verification is load-bearing for the claim that generative quality is preserved under the reported speedups.

Authors: We agree that, in general discrete distributions, pairwise conditional independence does not guarantee joint conditional independence for sets of size greater than two, and that higher-order dependencies could in principle remain. Our current experiments demonstrate that task-level generative quality is preserved under MI-guided parallelization (Sudoku accuracy and protein perplexity) and that the recovered MI maps align with known structural constraints, but these results do not directly quantify the factorization error on the specific subsets chosen at each decoding step. To address this point, we will add a direct diagnostic in the revised manuscript: for a representative sample of decoding trajectories, we will compute the KL divergence and total-variation distance between the model's joint conditional p(X_S | context) and the product of the individual conditionals over the MI-selected subsets. These quantities will be reported alongside the existing speed and quality metrics for both the Sudoku and ESM-C tasks, allowing readers to assess the practical validity of the factorization assumption at the operating thresholds we employ. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a supervised neural estimator trained to map hidden states of a pretrained masked diffusion model to pairwise conditional MI values, where the supervision labels are computed from the model's own conditional distributions. This constitutes a standard approximation task that learns an efficient surrogate from an auxiliary representation rather than re-deriving or tautologically copying the input labels. The subsequent use for MI-guided parallel decoding follows directly from deploying the trained network in a single forward pass and does not reduce to any self-definitional, fitted-input-renamed-as-prediction, or self-citation load-bearing step. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results are invoked. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that hidden states encode pairwise dependencies and that self-computed MI from the base model provides reliable supervision; no new physical entities are introduced and the only free parameters are those of the estimator network itself.

free parameters (1)

neural estimator weights
Parameters of the network that maps hidden states to MI matrix are learned from the supervision signal.

axioms (1)

domain assumption Pairwise conditional mutual information can be computed exactly from the masked diffusion model's conditional distributions
This supplies the ground-truth labels used to train the estimator.

pith-pipeline@v0.9.0 · 5674 in / 1388 out tokens · 77004 ms · 2026-05-21T14:01:45.381274+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Advances in Neural Information Processing Systems , volume=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

arXiv preprint arXiv:2306.11363 , year=

Masked Diffusion Models Are Fast Distribution Learners , author=. arXiv preprint arXiv:2306.11363 , year=

work page arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

International Conference on Machine Learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015
[6]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Advances in Neural Information Processing Systems , volume=

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2019
[9]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

Mask-Predict: Parallel Decoding of Conditional Masked Language Models , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2019
[10]

International Conference on Machine Learning , pages=

MINE: mutual information neural estimation , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[11]

International Conference on Machine Learning , pages=

On Variational Bounds of Mutual Information , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[12]

International Conference on Machine Learning , pages=

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[13]

International Conference on Machine Learning , pages=

A simple framework for contrastive learning of visual representations , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[14]

Advances in Neural Information Processing Systems , volume=

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Language models enable zero-shot prediction of the effects of mutations on protein function , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

2025 , eprint=

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , author=. 2025 , eprint=

work page 2025

[1] [1]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page

[2] [2]

Advances in Neural Information Processing Systems , volume=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [3]

arXiv preprint arXiv:2306.11363 , year=

Masked Diffusion Models Are Fast Distribution Learners , author=. arXiv preprint arXiv:2306.11363 , year=

work page arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

International Conference on Machine Learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015

[6] [6]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

Advances in Neural Information Processing Systems , volume=

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , author=. Advances in Neural Information Processing Systems , volume=

work page

[8] [8]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2019

[9] [9]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

Mask-Predict: Parallel Decoding of Conditional Masked Language Models , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2019

[10] [10]

International Conference on Machine Learning , pages=

MINE: mutual information neural estimation , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[11] [11]

International Conference on Machine Learning , pages=

On Variational Bounds of Mutual Information , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019

[12] [12]

International Conference on Machine Learning , pages=

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020

[13] [13]

International Conference on Machine Learning , pages=

A simple framework for contrastive learning of visual representations , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020

[14] [14]

Advances in Neural Information Processing Systems , volume=

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

Advances in Neural Information Processing Systems , volume=

Language models enable zero-shot prediction of the effects of mutations on protein function , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

2025 , eprint=

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , author=. 2025 , eprint=

work page 2025