Predictive Volatility of Machine Learning in Micro-Samples: A Regularised Assessment of Regional Poverty

A. H. Jamaluddin; A. T. R. Dani; N. I. Mahat; S. S. M. Fauzi; V. Ratnasari

arxiv: 2604.06278 · v4 · pith:YSJGLHDDnew · submitted 2026-04-07 · 📊 stat.ME · cs.CY· stat.AP

Predictive Volatility of Machine Learning in Micro-Samples: A Regularised Assessment of Regional Poverty

A. H. Jamaluddin , A. T. R. Dani , N. I. Mahat , V. Ratnasari , S. S. M. Fauzi This is my paper

Pith reviewed 2026-05-21 10:39 UTC · model grok-4.3

classification 📊 stat.ME cs.CYstat.AP

keywords povertyregional analysisregularizationmachine learningsmall samplesIndonesiaICT skillscross-validation

0 comments

The pith

Regularised linear shrinkage models outperform complex machine learning when identifying poverty drivers in small, collinear regional datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of unstable results in poverty analysis when datasets are limited to a few dozen observations and variables are highly interrelated. It systematically compares standard linear regression, penalised shrinkage methods, Bayesian approaches, a spatial model, and machine learning ensembles on Indonesian provincial data. The evaluation uses leave-one-out cross-validation to test true predictive ability rather than fit to the observed sample. Simple regularised linear models prove more reliable at avoiding overfitting than complex ensembles, and they consistently point to ICT skills as the strongest stable signal associated with lower poverty rates. A reader would care because policy decisions based on fragile statistical patterns can misdirect resources in data-poor regions.

Core claim

In data-constrained regional analysis, parametrically regularised linear shrinkage provides a more reliable mathematical foundation for isolating structural development priorities, such as ICT skills, than either naive OLS or unconstrained machine learning. This is shown by the superior out-of-sample performance of Ridge, Elastic Net, and LASSO models over complex ensembles such as BART, which suffer severe overfitting, when all are assessed via strict leave-one-out cross-validation on the n=34 provincial observations.

What carries the argument

Parametrically regularised linear shrinkage estimators, which stabilise coefficient estimates under multicollinearity by penalising model complexity during fitting.

If this is right

Simple linear shrinkage models achieve better out-of-sample prediction than complex ensembles like BART in small regional samples.
ICT skills emerge as the most stable proxy for lower provincial poverty across all successful regularised models.
Unconstrained machine learning carries a high risk of severe overfitting when applied to datasets with n around 34 and high collinearity.
Regularised linear methods supply a stronger basis than naive OLS for identifying structural priorities in such constrained settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of shrinkage models outperforming ensembles may appear in other small-sample regional studies of economic or social outcomes facing multicollinearity.
A direct test could apply the identical model-comparison framework to poverty or development indicators from provinces in neighbouring countries with comparable data sizes.
If ICT skills remain the dominant stable factor, targeted digital training programs could be examined as one concrete lever among the identified priorities.

Load-bearing premise

Leave-one-out cross-validation on the 34 provincial observations is assumed to reliably estimate true out-of-sample predictive performance despite high collinearity and without any external hold-out data or further robustness checks.

What would settle it

An independent test on data from additional provinces or a later time period in which the regularised linear models no longer show better predictive accuracy than the complex machine learning ensembles.

Figures

Figures reproduced from arXiv: 2604.06278 by A. H. Jamaluddin, A. T. R. Dani, N. I. Mahat, S. S. M. Fauzi, V. Ratnasari.

**Figure 2.** Figure 2: Spatial distribution of provincial poverty in Indonesia ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Posterior means and 95% credible intervals for the Bayesian Horseshoe model [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Posterior predictive check (PPC) for the Bayesian Horseshoe model (M8). The [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis demonstrating the impact of prior variance on the posterior [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Relative variable importance from Bayesian Additive Regression Trees (BART) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: SHAP (SHapley Additive exPlanations) summary beeswarm plot for the XGBoost [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Leave-One-Out RMSE across all evaluated model frameworks. Simple linear [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Small regional datasets pose a dual statistical problem: correlated predictors inflate estimation variance, while flexible learners can become unstable because the available information per adaptive degree of freedom is limited. We examine this issue through predictive volatility, defined as the cross-sample dispersion and upper-tail behaviour of out-of-sample loss. Using simulation evidence reported for sparse linear, near-linear and heavy-tailed settings, we compare ordinary least squares, frequentist penalties, Bayesian shrinkage models, bounded-response and spatial specifications, and flexible machine-learning procedures. In the reported simulation results, regularised linear estimators generally dominate in the linear high-collinearity micro-sample settings and remain the most reliable overall, whereas tree-based methods become more competitive only when the signal is weakly nonlinear and the sample size is larger. In the empirical application to 34 Indonesian provinces, ridge yields the best leave-one-out performance, followed by elastic net and lasso. Across the Bayesian shrinkage specifications, ICT skills show the most consistent negative association with poverty, with the strongest support under horseshoe and spike-and-slab formulations. These results suggest that, in micro-sample regional modelling, the main constraint is limited information per effective degree of freedom rather than insufficient algorithmic flexibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Regularized linear models outperform ensembles on this n=34 Indonesian poverty dataset, but the ICT stability claim looks sensitive to collinearity and single-point omissions.

read the letter

The main thing to know is that this paper finds regularized linear models outperform machine learning ensembles for predicting regional poverty in a small sample of 34 Indonesian provinces, with ICT skills emerging as a stable factor in the better-performing models. They compare standard OLS against penalized regressions like LASSO, Ridge, and Elastic Net, plus Bayesian shrinkage, a spatial ICAR model, and ensembles such as BART. All are assessed with leave-one-out cross-validation to handle the tiny sample size. This setup is sensible because LOOCV avoids wasting data in small-n problems, and the warning about overfitting in complex models is on point. Adding the spatial component shows some thought about the regional structure. On the positive side, the empirical comparison is clear and the conclusion that simplicity helps in data-constrained settings is fair. It applies these tools to a real policy issue in developing countries. The soft spots are around the stability of the ICT finding. With high collinearity likely present in regional covariates and only 34 observations, LOOCV results can be sensitive to individual data points. The paper doesn't appear to include additional robustness tests like multiple random splits or simulations to verify if the variable rankings hold up. Without those, the consistency across regularized models might not generalize beyond this sample. The abstract also skips specifics on variable coding and collinearity handling, which leaves some questions about implementation. This paper is for researchers focused on small-sample analysis in development economics or regional statistics. It doesn't break new theoretical ground but provides a practical example of method comparison. I'd recommend sending it for peer review. Referees can help strengthen the robustness section and clarify the details.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates statistical and machine learning approaches for identifying drivers of provincial poverty in Indonesia using a small sample (n=34) with high collinearity. It compares OLS, regularized linear models (Ridge, LASSO, Elastic Net), Bayesian shrinkage, a spatial ICAR model, and ensembles such as BART, assessing them via leave-one-out cross-validation (LOOCV). The central claims are that regularized linear shrinkage yields superior out-of-sample predictive performance and that ICT skills consistently rank as the most stable proxy for lower poverty across successful models, providing a more reliable basis for isolating structural priorities than naive OLS or unconstrained ML.

Significance. If the results hold, the work supplies useful evidence on the risks of complex ML in micro-samples and the advantages of parametric regularization for stable inference under collinearity, relevant to regional development statistics. Credit is due for the explicit model-comparison framework tailored to small n and for employing LOOCV rather than in-sample metrics.

major comments (2)

[Abstract and Evaluation Framework] Abstract and Evaluation Framework: The claim that ICT skills emerge as the most stable proxy across all successful regularised models is load-bearing for the primary contribution, yet no sensitivity checks (e.g., repeated random splits, bootstrap resampling of the n=34 observations, or simulation-based assessment of false-positive rates for variable selection) are reported to establish that this ranking is robust rather than an artifact of collinearity and single-point omissions in LOOCV.
[Methods] Methods: While LOOCV is a reasonable choice for n=34, the manuscript provides no details on how the regularization hyperparameter is selected in the presence of high collinearity (e.g., whether cross-validation paths were examined for stability or whether condition numbers/variance inflation factors were monitored), which directly affects the reliability of the reported coefficient stability for ICT.

minor comments (2)

[Abstract] Abstract: No information is given on variable coding, the exact grid or selection procedure for tuning parameters, or the quantitative criteria used to label models as 'successful'.
[Results] Results: A summary table of LOOCV performance metrics (RMSE, MAE, or R²) across all compared models would improve clarity and allow readers to assess the magnitude of the reported superiority of regularized linear models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the robustness and transparency of the analysis.

read point-by-point responses

Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework: The claim that ICT skills emerge as the most stable proxy across all successful regularised models is load-bearing for the primary contribution, yet no sensitivity checks (e.g., repeated random splits, bootstrap resampling of the n=34 observations, or simulation-based assessment of false-positive rates for variable selection) are reported to establish that this ranking is robust rather than an artifact of collinearity and single-point omissions in LOOCV.

Authors: We agree that the stability of the ICT skills ranking is central to our contribution and that additional sensitivity analyses would provide stronger evidence against potential artifacts from collinearity or LOOCV. While the consistency of this result across Ridge, LASSO, and Elastic Net already offers some reassurance, we will incorporate bootstrap resampling of the n=34 observations in the revised manuscript. This will allow us to report the proportion of bootstrap samples in which ICT skills ranks as the top or near-top predictor, directly addressing concerns about robustness. revision: yes
Referee: [Methods] Methods: While LOOCV is a reasonable choice for n=34, the manuscript provides no details on how the regularization hyperparameter is selected in the presence of high collinearity (e.g., whether cross-validation paths were examined for stability or whether condition numbers/variance inflation factors were monitored), which directly affects the reliability of the reported coefficient stability for ICT.

Authors: We thank the referee for highlighting this transparency gap. Hyperparameters for the regularized models were selected via the default LOOCV procedure in the glmnet implementation, but we did not report condition numbers, VIF values, or stability of the CV paths. In the revised Methods section we will add these details, including the selected lambda values, a brief description of the CV path behavior, and VIF diagnostics computed on the original predictor matrix to quantify the degree of collinearity. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on independent LOOCV evaluation

full rationale

The paper evaluates regularised linear models against OLS and ML ensembles by computing predictive performance via strict Leave-One-Out Cross-Validation on the n=34 sample. This LOOCV procedure generates out-of-sample error estimates that are not algebraically equivalent to the fitted coefficients or hyperparameters themselves. The emergence of ICT skills as the most stable proxy is reported as an empirical outcome of the cross-validated coefficient paths rather than a definitional or self-referential step. No self-citation load-bearing arguments, ansatz smuggling, or renaming of known results appear in the derivation chain; the central comparison between shrinkage and complex models is therefore self-contained against the external benchmark of LOOCV.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard domain assumptions about the validity of LOOCV for small collinear samples and the instability of OLS under those conditions; no new entities or free parameters are introduced beyond routine regularization tuning.

axioms (2)

domain assumption High multidimensional collinearity and small sample size (n=34) render standard OLS unstable and misleading
Invoked in the opening paragraph as the core statistical hazard motivating the model comparison.
domain assumption LOOCV provides a reliable estimate of predictive performance for model selection in this setting
Used as the strict evaluation criterion for all models.

pith-pipeline@v0.9.0 · 5780 in / 1374 out tokens · 54756 ms · 2026-05-21T10:39:51.838222+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

simple linear shrinkage models (Ridge, Elastic Net, LASSO) achieve the superior out-of-sample prediction... ICT skills consistently emerge as the most stable proxy for lower provincial poverty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.