pith. sign in

arxiv: 2605.08660 · v1 · submitted 2026-05-09 · 💻 cs.LG

Optimised Support Vector Regression for California Housing Price Prediction: The Critical Role of Feature Engineering and Hyperparameter Tuning

Pith reviewed 2026-05-12 01:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords Support Vector RegressionCalifornia HousingFeature EngineeringHyperparameter TuningAblation StudyR2 ScorePipelineRegression Benchmark
0
0 comments X

The pith

Support Vector Regression reaches 0.723 R-squared on California housing once scaled and tuned, closing most of the gap to tree-based models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether SVR's previously reported weak showing on the California Housing dataset stems from the algorithm itself or from how it was set up. By building ten domain-derived features from the raw inputs, placing scaling and feature selection inside a scikit-learn pipeline, and running randomized hyperparameter search with cross-validation, the authors lift test R2 from 0.60 to 0.723. An ablation study attributes most of the gain to scaling alone, with smaller additional lifts from the new features and tuning. In a ten-model comparison the tuned SVR finishes fourth, ahead of linear and nearest-neighbor baselines but behind gradient-boosted trees. Ten-fold cross-validation confirms the result generalizes without obvious leakage.

Core claim

The central claim is that SVR's low prior performance on this benchmark reflected experimental choices rather than an inherent limit. A leakage-safe pipeline that first scales the data, adds ten derived features, selects the strongest ones, and tunes the SVR kernel and regularization parameters produces a test R2 of 0.723. Scaling accounts for the largest jump (+0.744), feature engineering adds +0.026, and hyperparameter search contributes the final +0.008. The same configuration places SVR fourth among ten common regressors while maintaining stable performance under repeated cross-validation.

What carries the argument

A four-stage ablation inside a scikit-learn Pipeline that isolates scaling, derived-feature construction, feature-importance filtering, and randomized hyperparameter search.

If this is right

  • SVR can be competitive on tabular regression once scaling is handled inside a single pipeline.
  • Domain-motivated feature construction supplies a modest but measurable extra lift beyond scaling.
  • Hyperparameter search yields small further gains after scaling and features are fixed.
  • Proper preprocessing turns a previously last-place algorithm into a fourth-place one among standard regressors.
  • Ten-fold cross-validation results stay within a usable confidence band, supporting use on similar housing-price tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline discipline could be applied to other kernel methods that are sensitive to scale and feature representation.
  • If scaling alone explains most of the recovery, practitioners might first test a simple scaled SVR before moving to more complex models.
  • The modest extra gain from engineered features suggests that similar domain-derived ratios or aggregates may help on other geographic or economic regression problems.
  • Future benchmarks that report only raw-algorithm scores without standardized pipelines risk understating the reachable performance of simpler models.

Load-bearing premise

The reported gains arise from the described preprocessing and tuning steps rather than from overfitting the particular train-test split or from unstated differences in how the baseline 0.60 result was obtained.

What would settle it

Re-running the identical pipeline and ablation on a fresh random 80-20 split of the California Housing data and obtaining an R2 below 0.65 would falsify the claim that the configuration reliably lifts performance.

read the original abstract

In the recent literature, Support Vector Regression (SVR) has been cited as one of the weakest performers on the California Housing benchmark dataset, with Preethi et al. (2025)specifically ranking it last among the algorithms they tested, reporting an R2 of only 0.60. This paper examines whether the previously reported performance reflects experimental configuration choices rather than an inherent algorithmic limitation. A structured experimental workflow is applied: ten domain-motivated derived features are constructed from the eight raw inputs, an exploratory ensemble feature importance analysis identifies the most predictive candidates, and a randomised search over hyperparameter combinations with three-fold cross-validation selects the optimal SVR configuration within a leakage-safe scikit-learn Pipeline. A formal four-stage ablation study isolates the contribution of each component: scaling alone accounts for +0.744 in R2 (from -0.054 to 0.690), feature engineering adds +0.026 (to 0.716), and hyperparameter tuning contributes +0.008 (to 0.723). The resulting tuned SVR achieves a test R2 of 0.723, a 0.123-point absolute improvement over the previously reported SVR result (from 0.60 to 0.723, approximately 20% relative gain). In the ten-model comparison, the tuned SVR ranks fourth with R2 = 0.723, below XGBoost (0.832), Random Forest (0.814) and Gradient Boosting (0.783), while substantially outperforming simpler baselines. Ten-fold cross-validation yields a mean R2 of 0.703 (95% CI: [0.630, 0.775]), confirming robust generalisation. The observed improvement from R2 = 0.60 to R2 = 0.723 is associated primarily with proper feature scaling within a unified preprocessing pipeline, with domain-motivated feature engineering and systematic hyperparameter tuning, providing further incremental gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that prior reports of weak SVR performance (R²=0.60) on the California Housing dataset stem from suboptimal configuration rather than algorithmic limits. By constructing ten domain-derived features, performing exploratory ensemble feature importance analysis, and applying randomized hyperparameter search with 3-fold CV inside a leakage-safe scikit-learn Pipeline, the authors report a final test R² of 0.723 (a 0.123 absolute gain). A four-stage ablation attributes +0.744 to scaling, +0.026 to feature engineering, and +0.008 to tuning; the tuned SVR ranks fourth among ten models, with 10-fold CV mean R²=0.703 confirming generalization.

Significance. If the pipeline is verifiably leakage-free, the work usefully demonstrates that benchmark performance gaps can be closed through standard preprocessing and tuning, with the explicit ablation study providing a clear decomposition of contributions. The emphasis on a unified Pipeline and cross-validation offers a reproducible template for similar regression tasks, though the incremental nature of the gains limits broader theoretical impact.

major comments (3)
  1. [Abstract / Experimental Workflow] Abstract and experimental workflow: the exploratory ensemble feature importance analysis that selects candidates from the ten derived features occurs before the 3-fold CV hyperparameter search. If this step is performed on the full dataset (standard unless explicitly nested), it leaks test-set information into the retained feature set, invalidating both the reported test R²=0.723 and the ablation increments (+0.026 from features, +0.008 from tuning). The claim of a 'leakage-safe Pipeline' does not automatically extend to this pre-CV selection; the 10-fold CV mean of 0.703 cannot correct for a fixed feature set chosen outside the folds.
  2. [Ablation Study] Ablation study description: the four-stage ablation isolates scaling (+0.744), feature engineering (+0.026), and tuning (+0.008) to reach 0.723, but does not specify whether feature selection or importance ranking is re-executed inside each ablation stage's CV loop. Without this nesting, the incremental contributions cannot be isolated from leakage artifacts.
  3. [Introduction / Results] Comparison to Preethi et al.: the central 0.123-point improvement over the cited R²=0.60 requires that the prior baseline used an identical train-test split, preprocessing, and evaluation protocol. The manuscript provides no reproduction details or side-by-side configuration table, leaving open the possibility that part of the gain arises from differences in the baseline setup rather than the proposed components.
minor comments (2)
  1. [Abstract] The exact definitions and formulas for the ten domain-motivated derived features are not listed in the abstract; including them (or a table) would improve reproducibility.
  2. [Results] The 95% CI [0.630, 0.775] on the 10-fold CV mean R²=0.703 is reported, but it is unclear whether this uses the final selected feature set or re-runs selection inside the outer CV.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the experimental description, ensure leakage safety, and improve reproducibility.

read point-by-point responses
  1. Referee: [Abstract / Experimental Workflow] Abstract and experimental workflow: the exploratory ensemble feature importance analysis that selects candidates from the ten derived features occurs before the 3-fold CV hyperparameter search. If this step is performed on the full dataset (standard unless explicitly nested), it leaks test-set information into the retained feature set, invalidating both the reported test R²=0.723 and the ablation increments (+0.026 from features, +0.008 from tuning). The claim of a 'leakage-safe Pipeline' does not automatically extend to this pre-CV selection; the 10-fold CV mean of 0.703 cannot correct for a fixed feature set chosen outside the folds.

    Authors: We appreciate the referee's identification of this potential leakage issue. The manuscript presents the ensemble feature importance analysis as an exploratory step prior to the hyperparameter search conducted inside the Pipeline. To address this rigorously, we will revise the experimental workflow to nest the feature importance analysis within each cross-validation fold, restricting it to training data only via a custom scikit-learn transformer. The hyperparameter search and final evaluation will then proceed on the nested-selected features. We will re-execute the experiments under this fully nested protocol and update the test R², ablation increments, and 10-fold CV results in the revised manuscript if they differ from the current values. revision: yes

  2. Referee: [Ablation Study] Ablation study description: the four-stage ablation isolates scaling (+0.744), feature engineering (+0.026), and tuning (+0.008) to reach 0.723, but does not specify whether feature selection or importance ranking is re-executed inside each ablation stage's CV loop. Without this nesting, the incremental contributions cannot be isolated from leakage artifacts.

    Authors: We agree that the ablation study requires explicit nesting of feature selection to validly attribute the incremental gains. In the revised manuscript we will expand the ablation section to describe a fully nested procedure: for each ablation stage the ensemble feature importance analysis is re-run inside the CV loop on training data only, followed by scaling, feature engineering, and tuning as appropriate to that stage. The reported contributions will be recomputed under this protocol and presented with the updated values. revision: yes

  3. Referee: [Introduction / Results] Comparison to Preethi et al.: the central 0.123-point improvement over the cited R²=0.60 requires that the prior baseline used an identical train-test split, preprocessing, and evaluation protocol. The manuscript provides no reproduction details or side-by-side configuration table, leaving open the possibility that part of the gain arises from differences in the baseline setup rather than the proposed components.

    Authors: We will add a side-by-side configuration table in the revised Results section that explicitly lists the train-test split, preprocessing steps, feature handling, and evaluation protocol used in our work alongside the details reported by Preethi et al. (2025). This will make any differences transparent and allow readers to evaluate how much of the observed gain is attributable to our pipeline components versus protocol variations. We will also state the exact split and random seed employed in our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation results do not reduce to inputs by construction

full rationale

The paper presents an empirical ML study reporting measured R2 values from ablation experiments on the California Housing dataset. The central claims (R2=0.723 after scaling + feature engineering + tuning, with incremental contributions of +0.744, +0.026, +0.008) are obtained via cross-validated pipelines and compared to an external prior result (Preethi et al. R2=0.60). No equations, derivations, or self-referential definitions exist that would make any reported performance equivalent to its inputs by construction. Feature selection and hyperparameter search are described as occurring within a leakage-safe Pipeline, and the 10-fold CV mean is presented as an independent robustness check. This is a standard experimental workflow whose outputs are falsifiable against held-out data and external baselines; no load-bearing step collapses to a tautology or self-citation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard SVR assumptions and the validity of the cited baseline comparison; no new entities are introduced.

free parameters (1)
  • SVR hyperparameters (C, epsilon, kernel parameters)
    Selected via randomized search with 3-fold cross-validation on the training data
axioms (1)
  • domain assumption The California Housing data distribution allows domain-motivated feature transformations without introducing leakage when placed inside a scikit-learn Pipeline
    Invoked in the feature engineering and preprocessing stages

pith-pipeline@v0.9.0 · 5662 in / 1344 out tokens · 42697 ms · 2026-05-12T01:36:28.754983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    V. N. Vapnik,The Nature of Statistical Learning Theory, 2nd ed. New York, NY, USA: Springer-Verlag, 1997

  2. [2]

    Support vec- tor regression machines,

    H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vec- tor regression machines,” inAdvances in Neural Information Processing Systems, vol. 9. Cambridge, MA, USA: MIT Press, 1997, pp. 155–161

  3. [3]

    A tutorial on support vector regression,

    A. J. Smola and B. Sch¨ olkopf, “A tutorial on support vector regression,”Statistics and Computing, vol. 14, no. 3, pp. 199–222, Aug. 2004

  4. [4]

    A practical guide to support vector classification,

    C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification,” Dept. Comput. Sci., Nat. Taiwan Univ., Taipei, Tech. Rep., 2003

  5. [5]

    Scikit-learn: Machine learning in Python,

    F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

  6. [6]

    Comparative analysis of machine learning algorithms for Cal- ifornia housing price prediction,

    G. Preethiet al., “Comparative analysis of machine learning algorithms for Cal- ifornia housing price prediction,”SN Computer Science, vol. 6, no. 1, 2025, doi: 10.1007/s42979-024-03578-7. 24

  7. [7]

    Sparse spatial autoregressions,

    R. K. Pace and R. Barry, “Sparse spatial autoregressions,”Statistics and Probability Letters, vol. 33, no. 3, pp. 291–297, May 1997

  8. [8]

    Hedonic prices and implicit markets: Product differentiation in pure competition,

    S. Rosen, “Hedonic prices and implicit markets: Product differentiation in pure competition,”Journal of Political Economy, vol. 82, no. 1, pp. 34–55, 1974

  9. [9]

    House price prediction: Hedonic price model vs. artificial neural network,

    V. Limsombunchai, C. Gan, and M. Lee, “House price prediction: Hedonic price model vs. artificial neural network,”American Journal of Applied Sciences, vol. 1, no. 3, pp. 193–201, 2004

  10. [10]

    Determinants of house price: A decision tree approach,

    G. Z. Fan, S. E. Ong, and H. C. Koh, “Determinants of house price: A decision tree approach,”Urban Studies, vol. 43, no. 12, pp. 2301–2315, 2006

  11. [11]

    G´ eron,Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor- Flow, 2nd ed

    A. G´ eron,Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor- Flow, 2nd ed. Sebastopol, CA, USA: O’Reilly Media, 2019

  12. [12]

    A. S. Fotheringham, C. Brunsdon, and M. Charlton,Geographically Weighted Regression. Chichester, UK: Wiley, 2002

  13. [13]

    Multilevel modelling of real estate price data,

    G. Dong, R. Harris, and N. Jones, “Multilevel modelling of real estate price data,” Environment and Planning B, vol. 45, no. 6, pp. 1022–1041, 2018

  14. [14]

    Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data,

    B. Park and J. K. Bae, “Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data,”Expert Systems with Applications, vol. 42, no. 6, pp. 2928–2934, 2015

  15. [15]

    XGBoost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794

  16. [16]

    Hybrid optimisation of support vector regression for time series forecasting,

    Y. Chen, X. He, and Z. Li, “Hybrid optimisation of support vector regression for time series forecasting,”Applied Soft Computing, vol. 91, 2020, Art. no. 106296

  17. [17]

    Feature selection for SVR regression using a filter-wrapper hybrid approach,

    J. Yao, H. Zheng, and H. Jiang, “Feature selection for SVR regression using a filter-wrapper hybrid approach,”Neural Computing and Applications, vol. 33, pp. 7229–7241, 2021

  18. [18]

    Random forests,

    L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001. 25