pith. sign in

arxiv: 1907.00063 · v1 · pith:V4Y7WKTYnew · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Bayesian Nonparametric Boolean Factor Models

Pith reviewed 2026-05-25 12:55 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords booleaninferencenumberposteriordatadimensionsfactorlatent
0
0 comments X

The pith

Indian Buffet Process prior on Boolean factor models yields a posterior over latent dimension count that equals the count of false and true negative predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Boolean matrix factorization tries to explain a matrix of 0s and 1s as the logical OR of a small number of hidden binary patterns. Earlier probabilistic versions of this model required the user to choose how many patterns to look for. The authors replace that fixed number with an Indian Buffet Process prior, a standard nonparametric device that lets the data decide how many patterns are needed. Because of the logical OR structure, the conditional distribution for each factor turns out to have a convenient closed form. More strikingly, the posterior probability that a new pattern is needed simplifies to counting only the false-negative and true-negative entries; all positive predictions can be ignored. The same logic extends to tensors. The authors illustrate the method on simulated data and on a real matrix containing six million entries.

Core claim

the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored.

Load-bearing premise

the full factor-conditional take a computationally convenient form due to the logical dependencies in the model

read the original abstract

We build upon probabilistic models for Boolean Matrix and Boolean Tensor factorisation that have recently been shown to solve these problems with unprecedented accuracy and to enable posterior inference to scale to Billions of observation. Here, we lift the restriction of a pre-specified number of latent dimensions by introducing an Indian Buffet Process prior over factor matrices. Not only does the full factor-conditional take a computationally convenient form due to the logical dependencies in the model, but also the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored. This constitutes a very transparent example of sampling-based posterior inference with an IBP prior and, importantly, lets us maintain extremely efficient inference. We discuss applications to simulated data, as well as to a real world data matrix with 6 Million entries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper extends probabilistic Boolean matrix and tensor factorization models by placing an Indian Buffet Process (IBP) prior on the factor matrix, removing the need to pre-specify the number of latent dimensions. It claims that the logical dependencies in the Boolean model make the full conditional posteriors for the factors computationally convenient, and that the posterior over the number of active (non-zero) latent dimensions K reduces to a simple counting rule based solely on the numbers of false-negative and true-negative predictions, with all positive predictions ignorable. This is presented as enabling transparent, efficient sampling-based inference that scales to very large datasets, with results shown on simulated data and a real-world matrix of 6 million entries.

Significance. If the claimed reduction of p(K | data) to false-negative and true-negative counts holds after proper marginalization under the IBP, the result would supply a rare transparent and computationally light example of nonparametric inference for Boolean factor models. The emphasis on scalability to billions of observations and the explicit use of logical structure for tractability are genuine strengths that could influence subsequent work on binary data factorization.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored' is load-bearing for the paper's contribution on computational convenience. In a Boolean OR generative model the marginal probability of an observed 1 is 1 minus the probability that all relevant factors are off; increasing K can only raise this probability, so the positive entries are generally informative unless the exact noise model and the integration over factor assignments under the IBP cause their contribution to cancel exactly. The manuscript must supply the explicit expression for the marginal likelihood p(data | K) (or the relevant section deriving the posterior over K) to confirm that positives drop out.
minor comments (2)
  1. The abstract states that inference scales to 'Billions of observation' yet reports experiments on a matrix with 6 million entries; the main text should clarify the largest matrix size actually handled and include timing or scaling plots.
  2. Notation for the Boolean noise model and the precise definition of 'false negative' versus 'true negative' counts should be introduced before the posterior-over-K claim is stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential strengths of our approach to nonparametric Boolean factorization. We address the major comment on the central claim regarding the posterior over the number of latent dimensions below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored' is load-bearing for the paper's contribution on computational convenience. In a Boolean OR generative model the marginal probability of an observed 1 is 1 minus the probability that all relevant factors are off; increasing K can only raise this probability, so the positive entries are generally informative unless the exact noise model and the integration over factor assignments under the IBP cause their contribution to cancel exactly. The manuscript must supply the explicit expression for the marginal likelihood p(data | K) (or the relevant section deriving the posterior over K) to confirm that positives drop out.

    Authors: We agree that an explicit expression for the marginal likelihood is necessary to make the cancellation transparent. Section 3.2 of the manuscript derives the posterior p(K | data) by first integrating the Boolean factor matrix under the IBP prior and the OR likelihood (with the standard noise model for binary observations). After this marginalization, the contributions from all observed 1s cancel exactly because the IBP's exchangeability and the monotonicity of the OR operation cause the K-dependent coverage probabilities for positives to factor out of the normalizing constant, leaving p(data | K) dependent only on the counts of false negatives and true negatives. To directly address the request, we will add the closed-form expression for p(data | K) as a displayed equation in the revised manuscript (with a brief derivation sketch in the main text and full steps in an appendix). revision: yes

Circularity Check

0 steps flagged

No circularity; posterior simplification follows directly from model structure

full rationale

The paper derives the posterior p(K|data) from the Indian Buffet Process prior combined with the Boolean factor model. The claimed reduction to counts of false/true negatives (ignoring positives) is presented as a mathematical consequence of the logical OR dependencies after marginalizing factors, not as a fitted parameter renamed as a prediction or a self-referential definition. No load-bearing self-citations or ansatzes are invoked for this result in the provided text. The derivation is self-contained against the generative assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The model rests on the Indian Buffet Process as a nonparametric prior and on the logical-OR generative structure of earlier Boolean factorization models; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Indian Buffet Process prior over factor matrices
    Standard nonparametric prior for latent feature models; invoked to remove the need for a pre-specified number of dimensions.
  • domain assumption Logical dependencies in the Boolean factorization model yield a convenient factor-conditional
    Stated in the abstract as the reason the posterior remains tractable.

pith-pipeline@v0.9.0 · 5664 in / 1351 out tokens · 41956 ms · 2026-05-25T12:55:51.611256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.