Binary Stochastic Representations for Large Multi-class Classification

Aur\'elia L\'eon; Ludovic Denoyer; Nicolas Baskiotis; Thomas Gerald

arxiv: 1906.09838 · v1 · pith:AOQS6RL5new · submitted 2019-06-24 · 💻 cs.LG · stat.ML

Binary Stochastic Representations for Large Multi-class Classification

Thomas Gerald , Aur\'elia L\'eon , Nicolas Baskiotis , Ludovic Denoyer This is my paper

Pith reviewed 2026-05-25 17:27 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords large-scale classificationbinary codesend-to-end learningmulti-class classificationstochastic representationssublinear inference

0 comments

The pith

An end-to-end model learns binary codes for both categories and inputs at once to keep inference sublinear in large multi-class problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large multi-class classification suffers from linear inference cost when using one-vs-all methods. Earlier binary-code approaches achieve sublinear cost but require separate heuristics to assign codes to classes before training begins. The paper introduces Deep Stochastic Neural Codes, which optimizes code-to-class associations and input-to-code mappings together inside a single differentiable training loop. No external code-assignment step is needed, yet the model still decodes via Hamming distance or similar sublinear lookups. Experiments across datasets indicate the approach matches baseline accuracy while avoiding the pre-tuning step.

Core claim

The DSNC model keeps sublinear inference complexity while learning to associate binary codes with categories and to map inputs to binary codes simultaneously in an end-to-end fashion, without any a priori tuning, and experimental results show effectiveness w.r.t. baseline methods.

What carries the argument

Deep Stochastic Neural Codes (DSNC), which uses stochastic binary representations to jointly optimize code assignments and input mappings inside one training process.

If this is right

Inference remains sublinear in the number of classes at test time.
Code-to-category assignments emerge from gradient descent rather than separate heuristic search.
The same network produces both the binary code predictions and the class decisions.
No dataset-specific tuning of code lengths or assignment strategies is required before training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint optimization may reduce sensitivity to the initial choice of code length compared with two-stage methods.
Because codes are discovered rather than prescribed, the approach could adapt when new classes are added after initial training.
Stochastic sampling during training may act as implicit regularization that improves generalization on imbalanced large-class datasets.

Load-bearing premise

That jointly optimizing binary code assignments and input mappings via stochastic representations will produce codes that are both discriminative and efficiently decodable without the need for the complex pre-assignment heuristics used in prior binary code methods.

What would settle it

An experiment in which the learned codes require linear-time search for accurate decoding or yield lower accuracy than a heuristic-pre-assigned baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09838 by Aur\'elia L\'eon, Ludovic Denoyer, Nicolas Baskiotis, Thomas Gerald.

**Figure 1.** Figure 1: The DSNC model, with φ the probability distribution processed from input, b the binary code drawn from the distribution and dθ the decoding function from codes to classes. Given a code, we propose two different decoding methods to infer the corresponding class. The first one consists in using a function trained during the learning phase to compute the probability of each category for a given code: we wil… view at source ↗

**Figure 2.** Figure 2: Complexity and Accuracy trade-off on DMOZ-12K [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Classification with a large number of classes is a key problem in machine learning and corresponds to many real-world applications like tagging of images or textual documents in social networks. If one-vs-all methods usually reach top performance in this context, these approaches suffer from a high inference complexity, linear w.r.t the number of categories. Different models based on the notion of binary codes have been proposed to overcome this limitation, achieving in a sublinear inference complexity. But they a priori need to decide which binary code to associate to which category before learning using more or less complex heuristics. We propose a new end-to-end model which aims at simultaneously learning to associate binary codes with categories, but also learning to map inputs to binary codes. This approach called Deep Stochastic Neural Codes (DSNC) keeps the sublinear inference complexity but do not need any a priori tuning. Experimental results on different datasets show the effectiveness of the approach w.r.t baseline methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

End-to-end joint code learning is the actual novelty, but nothing enforces unique codes so the sublinear inference claim is not guaranteed.

read the letter

The main thing to know is that DSNC tries to learn class-to-binary-code assignments and input-to-code mappings together in one model, skipping the separate heuristic step that earlier binary code methods needed for large multi-class problems. That joint formulation is what sets it apart from the referenced prior work. It does a reasonable job stating the practical issue: one-vs-all classifiers scale poorly at inference when the number of classes reaches thousands, as in tagging or recommendation, and binary codes are a known route to sublinear lookup. Framing the whole thing as end-to-end and stochastic is a clean way to make the code assignment differentiable. The claim that this removes a priori tuning is fair on its face. The soft spot is the one the stress-test note flags. Nothing in the joint objective appears to penalize code collisions or enforce injectivity between classes and codes. If two classes converge to the same or nearby codes, predicting an L-bit vector no longer gives an unambiguous fast lookup; the decoder would need nearest-neighbor search or exhaustive checks, restoring linear cost. The abstract supplies no uniqueness regularizer, no post-training deduplication, and no analysis of final code diversity. Experimental details are also missing from the summary, so it is impossible to judge whether the reported gains actually survive that issue or rest on particular datasets and baselines. This paper is aimed at people who build production classifiers with large output spaces and care about inference latency inside existing supervised pipelines. A reader working on tagging or recsys efficiency would get value from seeing whether the joint training produces usable codes in practice. It deserves peer review because the problem is real and the end-to-end angle is a direct response to a documented limitation in prior methods, even though the complexity guarantee needs verification.

Referee Report

2 major / 2 minor

Summary. The paper proposes Deep Stochastic Neural Codes (DSNC), an end-to-end trainable model for large multi-class classification that jointly learns binary code assignments to categories and input-to-code mappings via stochastic representations. The approach is claimed to retain sublinear inference complexity (via code lookup) while avoiding the need for a priori code-to-class heuristics used in prior binary coding methods. Experiments on multiple datasets are reported to demonstrate effectiveness relative to baselines.

Significance. If the central claim holds, the work would offer a practical route to sublinear-time inference for problems with thousands of classes (e.g., tagging) without manual code design. The end-to-end joint optimization is a clear conceptual advance over two-stage methods, but its value depends on whether the learned codes remain sufficiently unique and decodable.

major comments (2)

[Abstract and §3] Abstract and §3 (model description): the sublinear-inference claim rests on the assumption that the learned code-to-class mapping is injective (or near-injective). The joint objective (cross-entropy on stochastic bits plus any auxiliary loss) contains no explicit uniqueness regularizer, collision penalty, or post-training deduplication step; nothing prevents two classes from converging to identical or Hamming-near codes, which would force the decoder back to linear or approximate-NN search.
[§4] §4 (experiments): no tables or text report the number of observed code collisions, the effective decoder complexity after training, or any ablation that measures inference time versus number of classes when collisions occur. Without these quantities the experimental superiority cannot be linked to the sublinear claim.

minor comments (2)

[Abstract] Abstract: omits all details on network architectures, loss functions, datasets, number of classes, and statistical significance testing, which are required to evaluate the experimental claims.
[§3] Notation: the stochastic bit representation and the precise form of the end-to-end objective are introduced without an equation number or explicit definition of the decoder lookup procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the sublinear-inference claim and its experimental support. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (model description): the sublinear-inference claim rests on the assumption that the learned code-to-class mapping is injective (or near-injective). The joint objective (cross-entropy on stochastic bits plus any auxiliary loss) contains no explicit uniqueness regularizer, collision penalty, or post-training deduplication step; nothing prevents two classes from converging to identical or Hamming-near codes, which would force the decoder back to linear or approximate-NN search.

Authors: The referee is correct that no explicit uniqueness term appears in the stated objective. At the same time, because code assignments and the input-to-code mapping are optimized jointly for classification accuracy, any collision between two class codes creates an irreducible source of confusion: inputs from either class would be mapped to the same stochastic representation, directly raising the cross-entropy loss. Consequently, any high-accuracy solution found by the optimizer must employ distinct codes. We will revise the model-description section to articulate this implicit guarantee and add a short paragraph on observed code uniqueness. revision: partial
Referee: [§4] §4 (experiments): no tables or text report the number of observed code collisions, the effective decoder complexity after training, or any ablation that measures inference time versus number of classes when collisions occur. Without these quantities the experimental superiority cannot be linked to the sublinear claim.

Authors: We agree that explicit reporting of collisions and post-training decoder complexity would make the link to sublinear inference more transparent. In the experiments we performed, the learned codes were unique (zero collisions), consistent with the reported accuracy improvements. We will add these statistics and a brief complexity discussion to the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture with independent experimental validation.

full rationale

The paper proposes DSNC as a novel end-to-end architecture that jointly learns class-to-code assignments and input-to-code mappings. No equations, derivations, or self-citations are shown that reduce the sublinear-inference claim or the reported effectiveness to a quantity fitted from the same model outputs. The central premise is presented as an architectural choice whose performance is assessed against external baselines on datasets, satisfying the condition for a self-contained result with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, loss terms, or training details; therefore no specific free parameters, domain axioms, or invented entities beyond standard neural network assumptions can be identified.

pith-pipeline@v0.9.0 · 5693 in / 1062 out tokens · 39779 ms · 2026-05-25T17:27:10.773380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

[1]

LSHTC: A Benchmark for Large-Scale Text Classification

Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M.r., Galinari, P.: Large Scale Hierarchical Text Classiﬁcation Challenge : A Benchmark for Large-Scale Text Classiﬁcation. arXiv:1503.08581v1 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

International Journal of Computer Vision 115(3) (2015) 211–252

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3) (2015) 211–252

work page 2015
[3]

IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3) (2014) 507–520

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Good Practice in Large-Scale Learning for Image Classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3) (2014) 507–520

work page 2014
[4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., L´ eonard, N., Courville, A.C.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

In: Advances in Neural Information Processing Systems 23

Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class tasks. In: Advances in Neural Information Processing Systems 23. (2010) 163–171

work page 2010
[6]

In: Proc

Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In: Proc. of the 30th International Conference on Machine Learning (ICML-13). Volume 28. (2013) 181–189

work page 2013
[7]

In: IEEE International Conference on Data Science and Advanced Analytics, DSAA

Puget, R., Baskiotis, N.: Hierarchical label partitioning for large scale classiﬁca- tion. In: IEEE International Conference on Data Science and Advanced Analytics, DSAA. (2015) 1–10

work page 2015
[8]

Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error- correcting output codes. J. of Artiﬁcial Intelligence Research 2 (1995) 263–286

work page 1995
[9]

In: Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence

Zhong, G., Cheriet, M.: Adaptive error-correcting output codes. In: Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence. IJCAI ’13, AAAI Press (2013) 1932–1938

work page 2013
[10]

In: Machine Learning and Knowledge Discovery in Databases: ECML PKDD

Ciss´ e, M., Arti` eres, T., Gallinari, P.: Learning compact class codes for fast inference in large multi class classiﬁcation. In: Machine Learning and Knowledge Discovery in Databases: ECML PKDD. Springer Berlin Heidelberg (2012) 506–520

work page 2012
[11]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence 36(6) (2014) 1107–1119

Norouzi, M., Punjani, A., Fleet, D.J.: Fast exact search in hamming space with multi-index hashing. IEEE Transactions on Pattern Analysis and Machine Intelli- gence 36(6) (2014) 1107–1119

work page 2014
[12]

In: Proceedings of the 25th International Conference on Very Large Data Bases

Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash- ing. In: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB ’99, Morgan Kaufmann Publishers Inc. (1999) 518–529

work page 1999
[13]

In: Advances in Neural Information Processing Systems 21

Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21. (2009) 1753–1760

work page 2009
[14]

International Journal of Ap- proximate Reasoning 50(7) (July 2009) 969–978

Salakhutdinov, R., Hinton, G.: Semantic hashing. International Journal of Ap- proximate Reasoning 50(7) (July 2009) 969–978

work page 2009
[15]

arXiv:1504.03410 (2015)

Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. arXiv:1504.03410 (2015)

work page arXiv 2015
[16]

In: Computer Vision – ECCV 2016: 14th European Conference, Proceed- ings, Part V

Do, T.T., Doan, A.D., Cheung, N.M.: Learning to hash with binary deep neural network. In: Computer Vision – ECCV 2016: 14th European Conference, Proceed- ings, Part V. Springer International Publishing, Cham (2016) 219–234

work page 2016
[17]

IEEE Trans

Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Trans. on Pattern Analysis and Machine Intelligence - to appear (2017)

work page 2017
[18]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact binary codes learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)

work page 2015
[19]

Machine Learning 8(3) (1992) 229–256

Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3) (1992) 229–256

work page 1992
[20]

Pat- tern Recognition 46(12) (2013) 3412–3424

Galar, M., Fern´ andez, A., Barrenechea, E., Bustince, H., Herrera, F.: Dynamic classiﬁer selection for one-vs-one strategy: avoiding non-competent classiﬁers. Pat- tern Recognition 46(12) (2013) 3412–3424

work page 2013
[21]

Deep Residual Learning for Image Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

LSHTC: A Benchmark for Large-Scale Text Classification

Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M.r., Galinari, P.: Large Scale Hierarchical Text Classiﬁcation Challenge : A Benchmark for Large-Scale Text Classiﬁcation. arXiv:1503.08581v1 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

International Journal of Computer Vision 115(3) (2015) 211–252

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3) (2015) 211–252

work page 2015

[3] [3]

IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3) (2014) 507–520

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Good Practice in Large-Scale Learning for Image Classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3) (2014) 507–520

work page 2014

[4] [4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., L´ eonard, N., Courville, A.C.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[5] [5]

In: Advances in Neural Information Processing Systems 23

Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class tasks. In: Advances in Neural Information Processing Systems 23. (2010) 163–171

work page 2010

[6] [6]

In: Proc

Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In: Proc. of the 30th International Conference on Machine Learning (ICML-13). Volume 28. (2013) 181–189

work page 2013

[7] [7]

In: IEEE International Conference on Data Science and Advanced Analytics, DSAA

Puget, R., Baskiotis, N.: Hierarchical label partitioning for large scale classiﬁca- tion. In: IEEE International Conference on Data Science and Advanced Analytics, DSAA. (2015) 1–10

work page 2015

[8] [8]

Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error- correcting output codes. J. of Artiﬁcial Intelligence Research 2 (1995) 263–286

work page 1995

[9] [9]

In: Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence

Zhong, G., Cheriet, M.: Adaptive error-correcting output codes. In: Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence. IJCAI ’13, AAAI Press (2013) 1932–1938

work page 2013

[10] [10]

In: Machine Learning and Knowledge Discovery in Databases: ECML PKDD

Ciss´ e, M., Arti` eres, T., Gallinari, P.: Learning compact class codes for fast inference in large multi class classiﬁcation. In: Machine Learning and Knowledge Discovery in Databases: ECML PKDD. Springer Berlin Heidelberg (2012) 506–520

work page 2012

[11] [11]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence 36(6) (2014) 1107–1119

Norouzi, M., Punjani, A., Fleet, D.J.: Fast exact search in hamming space with multi-index hashing. IEEE Transactions on Pattern Analysis and Machine Intelli- gence 36(6) (2014) 1107–1119

work page 2014

[12] [12]

In: Proceedings of the 25th International Conference on Very Large Data Bases

Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash- ing. In: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB ’99, Morgan Kaufmann Publishers Inc. (1999) 518–529

work page 1999

[13] [13]

In: Advances in Neural Information Processing Systems 21

Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21. (2009) 1753–1760

work page 2009

[14] [14]

International Journal of Ap- proximate Reasoning 50(7) (July 2009) 969–978

Salakhutdinov, R., Hinton, G.: Semantic hashing. International Journal of Ap- proximate Reasoning 50(7) (July 2009) 969–978

work page 2009

[15] [15]

arXiv:1504.03410 (2015)

Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. arXiv:1504.03410 (2015)

work page arXiv 2015

[16] [16]

In: Computer Vision – ECCV 2016: 14th European Conference, Proceed- ings, Part V

Do, T.T., Doan, A.D., Cheung, N.M.: Learning to hash with binary deep neural network. In: Computer Vision – ECCV 2016: 14th European Conference, Proceed- ings, Part V. Springer International Publishing, Cham (2016) 219–234

work page 2016

[17] [17]

IEEE Trans

Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Trans. on Pattern Analysis and Machine Intelligence - to appear (2017)

work page 2017

[18] [18]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact binary codes learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)

work page 2015

[19] [19]

Machine Learning 8(3) (1992) 229–256

Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3) (1992) 229–256

work page 1992

[20] [20]

Pat- tern Recognition 46(12) (2013) 3412–3424

Galar, M., Fern´ andez, A., Barrenechea, E., Bustince, H., Herrera, F.: Dynamic classiﬁer selection for one-vs-one strategy: avoiding non-competent classiﬁers. Pat- tern Recognition 46(12) (2013) 3412–3424

work page 2013

[21] [21]

Deep Residual Learning for Image Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014