Bayesian Nonparametric Boolean Factor Models

Christopher Yau; Tammo Rukat

arxiv: 1907.00063 · v1 · pith:V4Y7WKTYnew · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Bayesian Nonparametric Boolean Factor Models

Tammo Rukat , Christopher Yau This is my paper

Pith reviewed 2026-05-25 12:55 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords booleaninferencenumberposteriordatadimensionsfactorlatent

0 comments

The pith

Indian Buffet Process prior on Boolean factor models yields a posterior over latent dimension count that equals the count of false and true negative predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Boolean matrix factorization tries to explain a matrix of 0s and 1s as the logical OR of a small number of hidden binary patterns. Earlier probabilistic versions of this model required the user to choose how many patterns to look for. The authors replace that fixed number with an Indian Buffet Process prior, a standard nonparametric device that lets the data decide how many patterns are needed. Because of the logical OR structure, the conditional distribution for each factor turns out to have a convenient closed form. More strikingly, the posterior probability that a new pattern is needed simplifies to counting only the false-negative and true-negative entries; all positive predictions can be ignored. The same logic extends to tensors. The authors illustrate the method on simulated data and on a real matrix containing six million entries.

Core claim

the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored.

Load-bearing premise

the full factor-conditional take a computationally convenient form due to the logical dependencies in the model

read the original abstract

We build upon probabilistic models for Boolean Matrix and Boolean Tensor factorisation that have recently been shown to solve these problems with unprecedented accuracy and to enable posterior inference to scale to Billions of observation. Here, we lift the restriction of a pre-specified number of latent dimensions by introducing an Indian Buffet Process prior over factor matrices. Not only does the full factor-conditional take a computationally convenient form due to the logical dependencies in the model, but also the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored. This constitutes a very transparent example of sampling-based posterior inference with an IBP prior and, importantly, lets us maintain extremely efficient inference. We discuss applications to simulated data, as well as to a real world data matrix with 6 Million entries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a nonparametric Boolean factorization model via IBP where the posterior over factor count reduces to counting only false and true negatives, but this simplification looks questionable on the marginal likelihood.

read the letter

The main contribution is extending Boolean matrix factorization with an Indian Buffet Process prior so the number of latent dimensions does not have to be fixed in advance. They also state that the factor conditionals stay tractable because of the logical OR structure and that the posterior over the count of active factors simplifies to a count of false negatives and true negatives, with positives ignored entirely. This is presented as allowing efficient sampling-based inference on matrices with millions of entries, building on earlier scalable Boolean factorization work. That setup is new in the combination they describe. The scaling claim and the transparent inference story are the parts that could be useful if the math checks out. The central soft spot is exactly the claimed posterior form. The stress-test note is on point: in a probabilistic Boolean OR model the probability of an observed 1 rises with more factors, so the marginal likelihood over K should normally depend on how well positives are covered. For positives to drop out completely after marginalizing the factors, their contribution would have to be constant or cancel across different K values, which is not automatic from the logical dependencies alone. The abstract gives no equations, so it is impossible to see whether they have a special noise model or marginalization step that makes this true or whether the claim overreaches. The applications to simulated data and the 6-million-entry matrix are mentioned but not detailed enough here to assess whether the posterior behaves as stated in practice. This is for readers working on Bayesian nonparametric models for binary data who already know the fixed-K Boolean factorization literature. If the derivation holds it could be worth following; otherwise the main modeling idea is still worth seeing. It deserves a serious referee to examine the full marginalization steps and the experiments.

Referee Report

1 major / 2 minor

Summary. The paper extends probabilistic Boolean matrix and tensor factorization models by placing an Indian Buffet Process (IBP) prior on the factor matrix, removing the need to pre-specify the number of latent dimensions. It claims that the logical dependencies in the Boolean model make the full conditional posteriors for the factors computationally convenient, and that the posterior over the number of active (non-zero) latent dimensions K reduces to a simple counting rule based solely on the numbers of false-negative and true-negative predictions, with all positive predictions ignorable. This is presented as enabling transparent, efficient sampling-based inference that scales to very large datasets, with results shown on simulated data and a real-world matrix of 6 million entries.

Significance. If the claimed reduction of p(K | data) to false-negative and true-negative counts holds after proper marginalization under the IBP, the result would supply a rare transparent and computationally light example of nonparametric inference for Boolean factor models. The emphasis on scalability to billions of observations and the explicit use of logical structure for tractability are genuine strengths that could influence subsequent work on binary data factorization.

major comments (1)

[Abstract] Abstract: the central claim that 'the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored' is load-bearing for the paper's contribution on computational convenience. In a Boolean OR generative model the marginal probability of an observed 1 is 1 minus the probability that all relevant factors are off; increasing K can only raise this probability, so the positive entries are generally informative unless the exact noise model and the integration over factor assignments under the IBP cause their contribution to cancel exactly. The manuscript must supply the explicit expression for the marginal likelihood p(data | K) (or the relevant section deriving the posterior over K) to confirm that positives drop out.

minor comments (2)

The abstract states that inference scales to 'Billions of observation' yet reports experiments on a matrix with 6 million entries; the main text should clarify the largest matrix size actually handled and include timing or scaling plots.
Notation for the Boolean noise model and the precise definition of 'false negative' versus 'true negative' counts should be introduced before the posterior-over-K claim is stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential strengths of our approach to nonparametric Boolean factorization. We address the major comment on the central claim regarding the posterior over the number of latent dimensions below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored' is load-bearing for the paper's contribution on computational convenience. In a Boolean OR generative model the marginal probability of an observed 1 is 1 minus the probability that all relevant factors are off; increasing K can only raise this probability, so the positive entries are generally informative unless the exact noise model and the integration over factor assignments under the IBP cause their contribution to cancel exactly. The manuscript must supply the explicit expression for the marginal likelihood p(data | K) (or the relevant section deriving the posterior over K) to confirm that positives drop out.

Authors: We agree that an explicit expression for the marginal likelihood is necessary to make the cancellation transparent. Section 3.2 of the manuscript derives the posterior p(K | data) by first integrating the Boolean factor matrix under the IBP prior and the OR likelihood (with the standard noise model for binary observations). After this marginalization, the contributions from all observed 1s cancel exactly because the IBP's exchangeability and the monotonicity of the OR operation cause the K-dependent coverage probabilities for positives to factor out of the normalizing constant, leaving p(data | K) dependent only on the counts of false negatives and true negatives. To directly address the request, we will add the closed-form expression for p(data | K) as a displayed equation in the revised manuscript (with a brief derivation sketch in the main text and full steps in an appendix). revision: yes

Circularity Check

0 steps flagged

No circularity; posterior simplification follows directly from model structure

full rationale

The paper derives the posterior p(K|data) from the Indian Buffet Process prior combined with the Boolean factor model. The claimed reduction to counts of false/true negatives (ignoring positives) is presented as a mathematical consequence of the logical OR dependencies after marginalizing factors, not as a fitted parameter renamed as a prediction or a self-referential definition. No load-bearing self-citations or ansatzes are invoked for this result in the provided text. The derivation is self-contained against the generative assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The model rests on the Indian Buffet Process as a nonparametric prior and on the logical-OR generative structure of earlier Boolean factorization models; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Indian Buffet Process prior over factor matrices
Standard nonparametric prior for latent feature models; invoked to remove the need for a pre-specified number of dimensions.
domain assumption Logical dependencies in the Boolean factorization model yield a convenient factor-conditional
Stated in the abstract as the reason the posterior remains tractable.

pith-pipeline@v0.9.0 · 5664 in / 1351 out tokens · 41956 ms · 2026-05-25T12:55:51.611256+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

due to the logical dependencies in the model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.