pith. sign in

arxiv: 1907.05618 · v1 · pith:GJZMNXKJnew · submitted 2019-07-12 · 💻 cs.DB

Detecting coherent explorations in SQL workloads

Pith reviewed 2026-05-24 22:24 UTC · model grok-4.3

classification 💻 cs.DB
keywords SQL workloadscoherent explorationsquery featuresad-hoc queriesworkload analysisSQLSharedata explorationsequence segmentation
0
0 comments X

The pith

Features from SQL queries can separate ad-hoc sequences into coherent explorations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to detect coherent explorations hidden inside collections of SQL queries, especially the ad-hoc, hand-written ones typical of scientists and data scientists. It extracts a set of features that describe individual queries and shows that these features suffice to partition sequences of queries into distinct, meaningful explorations. The method is tested on the SQLShare workload and additional query collections to confirm that the separation works in practice. A reader would care because it turns raw logs of database use into interpretable units of user intent without requiring manual review.

Core claim

Extracting features that characterize SQL queries allows sequences within a workload to be separated into meaningful explorations, as shown by applying the approach to the SQLShare collection of ad-hoc queries and validating the results on several other workloads.

What carries the argument

Features that characterize SQL queries, applied to group sequences into coherent explorations.

If this is right

  • Ad-hoc workloads can be automatically segmented into units that reflect actual user data analysis sessions.
  • Platform operators gain a concrete way to observe how non-expert users explore uploaded datasets.
  • The same feature extraction and separation process applies across multiple query collections beyond the original test set.
  • Exploration detection becomes a repeatable step in workload analysis pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Query recommendation systems could use the detected explorations to suggest the next logical step inside an ongoing session.
  • Resource allocation on shared database platforms might be tuned to the common patterns found inside coherent explorations.
  • The same separation technique could be tried on logs from other interactive analysis environments to test generality.

Load-bearing premise

The chosen features drawn from SQL queries are enough to tell coherent explorations apart from unrelated sequences.

What would settle it

A manual review by domain experts finds that sequences the features label as one exploration actually contain unrelated queries or split a single exploration across multiple groups.

Figures

Figures reproduced from arXiv: 1907.05618 by Aboubakar Sidikhy Diakhaby, Patrick Marcel, Veronika Peralta, Willeme Verdeaux.

Figure 1
Figure 1. Figure 1: Value distribution of main query features in the SQLShare dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Feature correlation in datasets Open (up left), Enterprise (up right), SQLShare [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of similarity indexes for 3 sessions. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
read the original abstract

This paper presents a proposal aiming at better understanding a workload of SQL queries and detecting coherent explorations hidden within the workload. In particular, our work investigates SQLShare [11], a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [11], this workload is the only one containing primarily ad-hoc hand-written queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful explorations. We ran several tests over various query workloads to validate empirically our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes extracting features from SQL queries to characterize them and then using these features to separate sequences of queries within workloads (particularly the ad-hoc SQLShare dataset) into coherent, meaningful explorations. It reports empirical validation across multiple workloads to support the approach.

Significance. If the features can be shown to separate explorations in a non-circular manner with external validation, the work would offer a practical contribution to workload analysis in data science platforms, aiding tasks such as query recommendation and resource management for ad-hoc users.

major comments (2)
  1. [Validation experiments] Validation section (empirical tests on SQLShare and other workloads): the tests rely on the same feature-based separation to define what counts as a 'meaningful exploration,' without an independent ground truth (e.g., human labels of user intent, task-success metrics, or dataset-change logs) that could falsify the sufficiency claim if clusters fail to align with actual explorations.
  2. [Feature extraction and separation] Feature selection and separation method: no description is provided of how the chosen query features were selected or how separation quality was quantified (e.g., no metrics, baselines, or statistical tests), making it impossible to assess whether the data support the central claim that the features suffice to distinguish coherent sequences.
minor comments (1)
  1. [Abstract and Introduction] The abstract and introduction would benefit from explicit definitions of 'coherent exploration' and 'meaningful' to avoid ambiguity in the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our validation and methods.

read point-by-point responses
  1. Referee: [Validation experiments] Validation section (empirical tests on SQLShare and other workloads): the tests rely on the same feature-based separation to define what counts as a 'meaningful exploration,' without an independent ground truth (e.g., human labels of user intent, task-success metrics, or dataset-change logs) that could falsify the sufficiency claim if clusters fail to align with actual explorations.

    Authors: We agree that the validation on SQLShare is largely internal to the feature-based clustering, as the public dataset lacks explicit labels for user explorations. Our tests on additional workloads were intended to provide supporting evidence through observable coherence in query sequences, but we acknowledge this falls short of fully independent ground truth. In the revised manuscript we will explicitly discuss this limitation, add quantitative cluster validity metrics (e.g., silhouette scores), and include comparisons against alternative separation methods to allow readers to assess the approach more rigorously. revision: yes

  2. Referee: [Feature extraction and separation] Feature selection and separation method: no description is provided of how the chosen query features were selected or how separation quality was quantified (e.g., no metrics, baselines, or statistical tests), making it impossible to assess whether the data support the central claim that the features suffice to distinguish coherent sequences.

    Authors: We accept the referee's observation that the original manuscript lacks sufficient detail on feature selection and quality quantification. The features were chosen to reflect SQL elements relevant to iterative data exploration (e.g., table references, predicates, and aggregation patterns), but this rationale and any supporting analysis were not adequately documented. In revision we will add a dedicated subsection describing the feature selection process, report separation quality using standard metrics and baselines (such as random feature sets), and include appropriate statistical tests where applicable. revision: yes

Circularity Check

0 steps flagged

Empirical feature extraction with external workload validation; no derivation chain

full rationale

The paper describes extracting features from SQL queries to separate sequences into explorations, then validates the approach empirically across multiple workloads including the externally sourced SQLShare dataset. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the abstract or described method. The central claim rests on empirical tests rather than any closed-form reduction to prior quantities or inputs by construction, satisfying the criteria for a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5637 in / 928 out tokens · 19825 ms · 2026-05-24T22:24:28.040655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004

  2. [2]

    B. D. Bhattarai, M. Wong, and R. Singh. Discovering user information goals with semantic website media modeling. In MMM (1) , volume 4351 of Lecture Notes in Computer Science , pages 364–375. Springer, 2007

  3. [3]

    Chaudhuri and V

    S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007 , pages 3–14, 2007

  4. [4]

    Djedaini, K

    M. Djedaini, K. Drushku, N. Labroche, P. Marcel, V. Peralta, and W. Verdeaux. Automatic assessment of interactive OLAP explorations. Inf. Syst., 82:148–163, 2019. 32

  5. [5]

    Djedaini, N

    M. Djedaini, N. Labroche, P. Marcel, and V. Peralta. Detecting user focus in OLAP analyses. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24- 27, 2017, Proceedings, pages 105–119, 2017

  6. [6]

    Drushku, N

    K. Drushku, N. Labroche, P. Marcel, and V. Peralta. Interest-based rec- ommendations for business intelligence users. To appear in Information Systems, 2019. https://doi.org/10.1016/j.is.2018.08.004

  7. [7]

    Eirinaki, S

    M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collabo- rative database exploration. IEEE Trans. Knowl. Data Eng. , 26(7):1778– 1790, 2014

  8. [8]

    Feurer, A

    M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28 , pages 2962–2970. Curran Associates, Inc., 2015

  9. [9]

    Huang, A

    J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨ olkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006 , pages 601–608, 2006

  10. [10]

    Idreos, O

    S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data explo- ration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 , pages 277–281, 2015

  11. [11]

    S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Re- sults from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Con- ference 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , pages 281–293, 2016

  12. [12]

    Khoussainova, Y

    N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010

  13. [13]

    G. Kul, D. T. A. Luong, T. Xie, V. Chandola, O. Kennedy, and S. J. Upadhyaya. Similarity metrics for SQL query clustering. IEEE Trans. Knowl. Data Eng. , 30(12):2408–2420, 2018

  14. [14]

    H. V. Nguyen, K. B¨ ohm, F. Becker, B. Goldman, G. Hinkel, and E. M¨ uller. Identifying user interests within the data space - a case study with sky- server. In EDBT, pages 641–652. OpenProceedings.org, 2015

  15. [15]

    S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, 2010. 33

  16. [16]

    Peralta, W

    V. Peralta, W. Verdeaux, Y. Raimont, and P. Marcel. Qualitative analysis of the sqlshareworkload for session segmentation. In Proceedings of the 21st International Workshop on Design, Optimization, Languages and Analyti- cal Processing of Big Data, co-located with EDBT/ICDT Joint Conference, DOLAP@EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019. , 2019

  17. [17]

    Ratner, S

    A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´ e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269–282, 2017

  18. [18]

    Romero, P

    O. Romero, P. Marcel, A. Abell´ o, V. Peralta, and L. Bellatreche. Describing analytical sessions using a multidimensional algebra. In Data Warehousing and Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse, France, August 29-September 2,2011. Proceedings , pages 224– 239, 2011

  19. [19]

    Salvador and P

    S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In 16th IEEE Interna- tional Conference on Tools with Artificial Intelligence (ICTAI 2004), 15-17 November 2004, Boca Raton, FL, USA , pages 576–584, 2004

  20. [20]

    Satopaa, J

    V. Satopaa, J. R. Albrecht, D. E. Irwin, and B. Raghavan. Finding a ”kneedle” in a haystack: Detecting knee points in system behavior. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Min- nesota, USA, pages 166–171, 2011

  21. [21]

    Singh, J

    V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebe- deva, and B. Yanny. Skyserver traffic report - the first five years. Technical report, December 2006

  22. [22]

    Sugiyama, S

    M. Sugiyama, S. Nakajima, H. Kashima, P. von B¨ unau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada,...

  23. [23]

    van den Brink, R

    H. van den Brink, R. van der Leek, and J. Visser. Quality assessment for embedded SQL. In SCAM, pages 163–170. IEEE Computer Society, 2007

  24. [24]

    Vashistha and S

    A. Vashistha and S. Jain. Measuring query complexity in sqlshare workload. https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf

  25. [25]

    R. W. White. Interactions with Search Systems . Cambridge University Press, 2016

  26. [26]

    M. Wong, B. Bhattarai, and R. Singh. Characterization and analysis of usage patterns in large multimedia websites. Technical report, 2006. 34