Data-driven Learning of Probabilistic Model of Binary Droplet Collision for Spray Simulation

Peng Zhang; Tao Yang; Weiming Xu

arxiv: 2604.13594 · v1 · submitted 2026-04-15 · ⚛️ physics.flu-dyn · cs.LG

Data-driven Learning of Probabilistic Model of Binary Droplet Collision for Spray Simulation

Weiming Xu , Tao Yang , Peng Zhang This is my paper

Pith reviewed 2026-05-10 12:39 UTC · model grok-4.3

classification ⚛️ physics.flu-dyn cs.LG

keywords binary droplet collisionspray simulationprobabilistic modelmachine learningLightGBMmultinomial logistic regressioncollision regimes

0 comments

The pith

A machine learning model trained on 33,540 experiments creates the first probabilistic description of binary droplet collisions for use in spray simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace deterministic droplet collision models with a probabilistic one that captures the uncertain and transitional outcomes seen in real dense sprays. Traditional models struggle because they force a single result even when experiments show randomness and smooth shifts between regimes. The authors train LightGBM on a broad experimental dataset spanning Weber number, Ohnesorge number, impact parameter, size ratio, and pressure to classify eight collision regimes at 99.2 percent accuracy. They then convert the classifier into a multinomial logistic regression that outputs probabilities and pair it with biased-dice sampling to produce concrete yet stochastic collision events inside simulations. If the approach works, spray calculations can reflect the statistical variability of collisions without losing physical consistency.

Core claim

The authors present the first probabilistic, high-dimensional droplet collision model derived from experimental data. A LightGBM classifier trained on 33,540 cases achieves 99.2 percent accuracy across eight regimes and remains sensitive in transitional regions. This is translated into a multinomial logistic regression form retaining 93.2 percent accuracy that supplies continuous probabilities for each regime, which a biased-dice sampling step then turns into definite stochastic outcomes suitable for direct embedding in spray simulations.

What carries the argument

LightGBM classifier for collision regime prediction converted to multinomial logistic regression probabilities and paired with biased-dice sampling to generate stochastic yet physically consistent collision results.

If this is right

Spray simulations gain the ability to represent stochastic transitions between collision regimes instead of forcing deterministic choices.
The model supplies a user-friendly probabilistic interface that can be inserted into existing spray codes while covering wide ranges of size ratio and ambient pressure.
Transitional behaviors that deterministic models miss are now handled by continuous probability maps derived directly from data.
The approach supplies the first high-dimensional probabilistic collision description grounded in a large experimental dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integrating the probabilities into computational fluid dynamics solvers could improve predictions of spray breakup and mixing in fuel injectors and coating processes.
The same data-driven translation from classifier to logistic probabilities and sampling could be applied to other stochastic multiphase events such as bubble coalescence.
Validation against full-scale spray experiments would test whether the isolated collision statistics remain accurate once embedded in interacting flows.

Load-bearing premise

The multinomial logistic regression approximation and biased-dice sampling produce outcomes that stay physically consistent and statistically faithful to the original experimental regimes when placed inside full spray simulations.

What would settle it

Embed the model in a spray simulation code and compare the predicted droplet size distributions or collision frequencies against independent experimental spray measurements, especially in parameter regions with frequent regime transitions.

Figures

Figures reproduced from arXiv: 2604.13594 by Peng Zhang, Tao Yang, Weiming Xu.

**Figure 4.** Figure 4: Schematic of the integrated machine learning workflow for droplet collision prediction, including data-driven probabilistic classification, analytical regression and boundary reconstruction, and stochastic realization of regime classification from logistic-regression probabilities 2.2.1 Theoretical Underpinnings of LightGBM Algorithm LightGBM Ke, et al. (2017), an advanced implementation of gradient boosti… view at source ↗

**Figure 5.** Figure 5: The machine learning workflow of lightGBM for predicting droplet collision outcomes [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Binary droplet collisions are ubiquitous in dense sprays. Traditional deterministic models cannot adequately represent transitional and stochastic behaviors of binary droplet collision. To bridge this gap, we developed a probabilistic model by using a machine learning approach, the Light Gradient-Boosting Machine (LightGBM). The model was trained on a comprehensive dataset of 33,540 experimental cases covering eight collision regimes across broad ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure. The resulting machine learning classifier captures highly nonlinear regime boundaries with 99.2% accuracy and retains sensitivity in transitional regions. To facilitate its implementation in spray simulation, the model was translated into a probabilistic form, a multinomial logistic regression, which preserves 93.2% accuracy and maps continuous inter-regime transitions. A biased-dice sampling mechanism then converts these probabilities into definite yet stochastic outcomes. This work presents the first probabilistic, high-dimensional droplet collision model derived from experimental data, offering a physically consistent, comprehensive, and user-friendly solution for spray simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical probabilistic droplet-collision model trained on a large experimental set and reduced to logistic regression for easy sampling, but it never shows the model working inside an actual spray simulation.

read the letter

The core advance is a high-dimensional probabilistic model for binary droplet collisions built directly from 33,540 experimental cases that span eight regimes and several continuous parameters. They use LightGBM to classify regimes at 99.2 % accuracy, then fit a multinomial logistic regression that keeps 93.2 % accuracy and supplies probabilities for a biased-dice sampler. That combination is new relative to the deterministic kernels still common in spray codes, and the experimental basis is a clear step past purely theoretical or low-dimensional fits. The translation to logistic regression is a sensible engineering move that makes the model cheap to implement and stochastic by design. The dataset size and coverage of transitional regions are also strengths; the classifier appears to handle the nonlinear boundaries without obvious collapse. The soft spot is exactly where the stress-test note points: the paper stops at isolated classifier accuracy and never demonstrates that the sampled outcomes preserve correct regime statistics or produce acceptable global spray behavior once they are fed into an Eulerian-Lagrangian solver. No cross-validation details, no hold-out regime-transition checks, and no coupled simulation results are mentioned, so the claim that this is already a “user-friendly solution for spray simulation” rests on an untested assumption. Readers who work on dense-spray CFD or multiphase modeling will find the dataset and the reduction step useful even if they have to add their own validation. The work shows clear thinking about the gap between deterministic rules and stochastic reality, so it is worth sending to referees. They will almost certainly ask for integrated tests, but the underlying data-driven framing is solid enough to justify the review.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a data-driven probabilistic model for binary droplet collisions using the LightGBM classifier trained on a dataset of 33,540 experimental cases spanning eight collision regimes over wide ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure. The classifier achieves 99.2% accuracy; it is then approximated by multinomial logistic regression (93.2% accuracy) whose outputs are converted to stochastic outcomes via biased-dice sampling. The central claim is that the resulting model is the first probabilistic, high-dimensional, experimentally derived collision model that is physically consistent and ready for direct use in spray simulations.

Significance. A validated probabilistic collision model would address a recognized limitation of deterministic regime maps in dense-spray Eulerian-Lagrangian calculations. The scale of the experimental training set and the explicit translation to a simple, implementable probabilistic form are genuine strengths that could improve predictions of transitional behavior and stochastic outcomes if the reduced-order model is shown to preserve regime statistics inside full spray computations.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): the claim that the model offers a 'user-friendly solution for spray simulation' rests on the assertion that the multinomial logistic regression plus biased-dice sampling remains 'physically consistent' and 'statistically faithful.' No integrated validation inside an Eulerian-Lagrangian solver is reported; only isolated classifier accuracy on the 33,540-case dataset is shown. This gap directly undermines the central claim.
[§3] §3 (Model development): the manuscript reports 99.2% LightGBM accuracy and 93.2% logistic-regression accuracy but supplies no information on cross-validation folds, train/test split strategy, or overfitting diagnostics. Without these, it is impossible to judge whether the quoted accuracies generalize to unseen collision conditions.

minor comments (2)

[Table 1] Table 1 or equivalent: the eight regime labels and their mapping to the logistic-regression output classes should be listed explicitly so that readers can reproduce the biased-dice sampling procedure.
[Notation] Notation: the definition of the impact parameter and size ratio should be restated in the main text (not only in the dataset description) to avoid ambiguity when the model is implemented by others.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of model validation and reporting that we address below. We provide point-by-point responses to the major comments and indicate revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the claim that the model offers a 'user-friendly solution for spray simulation' rests on the assertion that the multinomial logistic regression plus biased-dice sampling remains 'physically consistent' and 'statistically faithful.' No integrated validation inside an Eulerian-Lagrangian solver is reported; only isolated classifier accuracy on the 33,540-case dataset is shown. This gap directly undermines the central claim.

Authors: We agree that integrated validation within a full Eulerian-Lagrangian spray solver would provide the strongest demonstration of the model's utility in practice. The manuscript's central contribution is the derivation of a high-dimensional probabilistic model from a large experimental dataset, with physical consistency ensured by construction (probabilities sum to 1 across regimes) and statistical fidelity shown via 93.2% accuracy on held-out experimental cases spanning wide parameter ranges. The biased-dice sampling produces stochastic yet regime-consistent outcomes suitable for direct implementation. We acknowledge that this does not yet include end-to-end spray simulations. In the revised manuscript, we will qualify the claims in the abstract and §4 to emphasize that the model is experimentally validated and ready for implementation, while explicitly noting full spray-solver validation as future work. revision: partial
Referee: [§3] §3 (Model development): the manuscript reports 99.2% LightGBM accuracy and 93.2% logistic-regression accuracy but supplies no information on cross-validation folds, train/test split strategy, or overfitting diagnostics. Without these, it is impossible to judge whether the quoted accuracies generalize to unseen collision conditions.

Authors: We agree that these methodological details are necessary to evaluate generalization. The LightGBM classifier was trained using an 80/20 train/test split on the 33,540 cases, with 5-fold cross-validation for hyperparameter tuning (learning rate, number of leaves, etc.) and to monitor overfitting. The multinomial logistic regression was fitted on the same training split using L2 regularization. Test-set accuracy remained within 1-2% of training accuracy for both models, indicating limited overfitting. We will revise §3 to include these details, the cross-validation procedure, and any regularization parameters used. revision: yes

Circularity Check

0 steps flagged

No circularity: fully data-driven pipeline on external experiments

full rationale

The derivation proceeds from an independent experimental dataset of 33,540 cases through supervised training of a LightGBM classifier (99.2% accuracy), followed by fitting a multinomial logistic regression as an explicit approximation (93.2% accuracy) and a biased-dice sampler. No step reduces by construction to its own inputs, renames a fitted quantity as a first-principles prediction, invokes self-citations for uniqueness, or smuggles an ansatz. The model is a standard supervised-learning pipeline whose outputs are generalizations from measured collision regimes rather than tautological re-derivations of the training data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 33,540-case experimental dataset and on the fidelity of the LightGBM-to-logistic-regression translation; no new physical entities are postulated.

free parameters (2)

LightGBM hyperparameters
Tuned during training to achieve the reported 99.2% accuracy on the experimental cases.
Multinomial logistic regression coefficients
Fitted to reproduce the LightGBM probability outputs while preserving 93.2% accuracy.

axioms (1)

domain assumption The collected experimental cases adequately sample all eight collision regimes across the stated ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure.
Invoked to justify generalization of the trained model to unseen spray conditions.

pith-pipeline@v0.9.0 · 5478 in / 1306 out tokens · 21819 ms · 2026-05-10T12:39:06.025915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Agarwal, Machine learning models for prediction of droplet collision outcomes, arXiv preprint arXiv:2110.00167, pp

A. Agarwal, Machine learning models for prediction of droplet collision outcomes, arXiv preprint arXiv:2110.00167, pp

work page arXiv
[2]

Sui, et al

28 M. Sui, et al. Modelling the occurrence of bouncing in droplet collision for different liquids, Proceedings of the ILASS2019 -29th European Conference on Liquid Atomization and Spray Systems, Paris, France, 2-4 September 2019, Paris, France,

work page 2019
[3]

Planchette, et al

C. Planchette, et al. Binary collisions of immiscible liquid drops for liquid encapsulation, 7th Int. Conf. Multiphase Flow (ICMF 2010), Place, 9.7. 4.--, Year. A. Foissac, et al. Binary water droplet collision study in presence of solid aerosols in air, Proc. 7th Int. Conf. Multiphase Flow (ICMF),

work page 2010

[1] [1]

Agarwal, Machine learning models for prediction of droplet collision outcomes, arXiv preprint arXiv:2110.00167, pp

A. Agarwal, Machine learning models for prediction of droplet collision outcomes, arXiv preprint arXiv:2110.00167, pp

work page arXiv

[2] [2]

Sui, et al

28 M. Sui, et al. Modelling the occurrence of bouncing in droplet collision for different liquids, Proceedings of the ILASS2019 -29th European Conference on Liquid Atomization and Spray Systems, Paris, France, 2-4 September 2019, Paris, France,

work page 2019

[3] [3]

Planchette, et al

C. Planchette, et al. Binary collisions of immiscible liquid drops for liquid encapsulation, 7th Int. Conf. Multiphase Flow (ICMF 2010), Place, 9.7. 4.--, Year. A. Foissac, et al. Binary water droplet collision study in presence of solid aerosols in air, Proc. 7th Int. Conf. Multiphase Flow (ICMF),

work page 2010