Detecting coherent explorations in SQL workloads

Aboubakar Sidikhy Diakhaby; Patrick Marcel; Veronika Peralta; Willeme Verdeaux

arxiv: 1907.05618 · v1 · pith:GJZMNXKJnew · submitted 2019-07-12 · 💻 cs.DB

Detecting coherent explorations in SQL workloads

Veronika Peralta , Patrick Marcel , Willeme Verdeaux , Aboubakar Sidikhy Diakhaby This is my paper

Pith reviewed 2026-05-24 22:24 UTC · model grok-4.3

classification 💻 cs.DB

keywords SQL workloadscoherent explorationsquery featuresad-hoc queriesworkload analysisSQLSharedata explorationsequence segmentation

0 comments

The pith

Features from SQL queries can separate ad-hoc sequences into coherent explorations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to detect coherent explorations hidden inside collections of SQL queries, especially the ad-hoc, hand-written ones typical of scientists and data scientists. It extracts a set of features that describe individual queries and shows that these features suffice to partition sequences of queries into distinct, meaningful explorations. The method is tested on the SQLShare workload and additional query collections to confirm that the separation works in practice. A reader would care because it turns raw logs of database use into interpretable units of user intent without requiring manual review.

Core claim

Extracting features that characterize SQL queries allows sequences within a workload to be separated into meaningful explorations, as shown by applying the approach to the SQLShare collection of ad-hoc queries and validating the results on several other workloads.

What carries the argument

Features that characterize SQL queries, applied to group sequences into coherent explorations.

If this is right

Ad-hoc workloads can be automatically segmented into units that reflect actual user data analysis sessions.
Platform operators gain a concrete way to observe how non-expert users explore uploaded datasets.
The same feature extraction and separation process applies across multiple query collections beyond the original test set.
Exploration detection becomes a repeatable step in workload analysis pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Query recommendation systems could use the detected explorations to suggest the next logical step inside an ongoing session.
Resource allocation on shared database platforms might be tuned to the common patterns found inside coherent explorations.
The same separation technique could be tried on logs from other interactive analysis environments to test generality.

Load-bearing premise

The chosen features drawn from SQL queries are enough to tell coherent explorations apart from unrelated sequences.

What would settle it

A manual review by domain experts finds that sequences the features label as one exploration actually contain unrelated queries or split a single exploration across multiple groups.

Figures

Figures reproduced from arXiv: 1907.05618 by Aboubakar Sidikhy Diakhaby, Patrick Marcel, Veronika Peralta, Willeme Verdeaux.

**Figure 2.** Figure 2: Feature correlation in datasets Open (up left), Enterprise (up right), SQLShare [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of similarity indexes for 3 sessions. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

read the original abstract

This paper presents a proposal aiming at better understanding a workload of SQL queries and detecting coherent explorations hidden within the workload. In particular, our work investigates SQLShare [11], a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [11], this workload is the only one containing primarily ad-hoc hand-written queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful explorations. We ran several tests over various query workloads to validate empirically our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies feature extraction to detect coherent SQL explorations but relies on unvalidated clustering.

read the letter

This paper takes standard SQL query features and applies them to split sequences in workloads like SQLShare into coherent explorations. The main takeaway is that while it targets an interesting dataset, the approach lacks an external check on whether the resulting groups are actually meaningful to users. It does a decent job highlighting SQLShare as a source of ad-hoc queries from scientists, which is rarer than the usual benchmark workloads. Running the method on multiple workloads gives some sense of how it behaves in practice. The soft spot is the validation. Without human judgments, task completion data, or other ground truth to test against, it's difficult to tell if the features truly capture coherent explorations or just produce some clusters. The paper presents this as empirical validation, but the separation quality isn't tied to an independent criterion that could disprove the claim. Readers working on database interfaces or query recommendation systems could find the feature list and grouping idea useful as a starting point. It won't change the field, but it adds a data point on workload analysis for non-expert users. I would send this to peer review. The core idea is straightforward enough that referees can evaluate the specifics and suggest improvements to the evaluation.

Referee Report

2 major / 1 minor

Summary. The paper proposes extracting features from SQL queries to characterize them and then using these features to separate sequences of queries within workloads (particularly the ad-hoc SQLShare dataset) into coherent, meaningful explorations. It reports empirical validation across multiple workloads to support the approach.

Significance. If the features can be shown to separate explorations in a non-circular manner with external validation, the work would offer a practical contribution to workload analysis in data science platforms, aiding tasks such as query recommendation and resource management for ad-hoc users.

major comments (2)

[Validation experiments] Validation section (empirical tests on SQLShare and other workloads): the tests rely on the same feature-based separation to define what counts as a 'meaningful exploration,' without an independent ground truth (e.g., human labels of user intent, task-success metrics, or dataset-change logs) that could falsify the sufficiency claim if clusters fail to align with actual explorations.
[Feature extraction and separation] Feature selection and separation method: no description is provided of how the chosen query features were selected or how separation quality was quantified (e.g., no metrics, baselines, or statistical tests), making it impossible to assess whether the data support the central claim that the features suffice to distinguish coherent sequences.

minor comments (1)

[Abstract and Introduction] The abstract and introduction would benefit from explicit definitions of 'coherent exploration' and 'meaningful' to avoid ambiguity in the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our validation and methods.

read point-by-point responses

Referee: [Validation experiments] Validation section (empirical tests on SQLShare and other workloads): the tests rely on the same feature-based separation to define what counts as a 'meaningful exploration,' without an independent ground truth (e.g., human labels of user intent, task-success metrics, or dataset-change logs) that could falsify the sufficiency claim if clusters fail to align with actual explorations.

Authors: We agree that the validation on SQLShare is largely internal to the feature-based clustering, as the public dataset lacks explicit labels for user explorations. Our tests on additional workloads were intended to provide supporting evidence through observable coherence in query sequences, but we acknowledge this falls short of fully independent ground truth. In the revised manuscript we will explicitly discuss this limitation, add quantitative cluster validity metrics (e.g., silhouette scores), and include comparisons against alternative separation methods to allow readers to assess the approach more rigorously. revision: yes
Referee: [Feature extraction and separation] Feature selection and separation method: no description is provided of how the chosen query features were selected or how separation quality was quantified (e.g., no metrics, baselines, or statistical tests), making it impossible to assess whether the data support the central claim that the features suffice to distinguish coherent sequences.

Authors: We accept the referee's observation that the original manuscript lacks sufficient detail on feature selection and quality quantification. The features were chosen to reflect SQL elements relevant to iterative data exploration (e.g., table references, predicates, and aggregation patterns), but this rationale and any supporting analysis were not adequately documented. In revision we will add a dedicated subsection describing the feature selection process, report separation quality using standard metrics and baselines (such as random feature sets), and include appropriate statistical tests where applicable. revision: yes

Circularity Check

0 steps flagged

Empirical feature extraction with external workload validation; no derivation chain

full rationale

The paper describes extracting features from SQL queries to separate sequences into explorations, then validates the approach empirically across multiple workloads including the externally sourced SQLShare dataset. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the abstract or described method. The central claim rests on empirical tests rather than any closed-form reduction to prior quantities or inputs by construction, satisfying the criteria for a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5637 in / 928 out tokens · 19825 ms · 2026-05-24T22:24:28.040655+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004

work page 2004
[2]

B. D. Bhattarai, M. Wong, and R. Singh. Discovering user information goals with semantic website media modeling. In MMM (1) , volume 4351 of Lecture Notes in Computer Science , pages 364–375. Springer, 2007

work page 2007
[3]

Chaudhuri and V

S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007 , pages 3–14, 2007

work page 2007
[4]

Djedaini, K

M. Djedaini, K. Drushku, N. Labroche, P. Marcel, V. Peralta, and W. Verdeaux. Automatic assessment of interactive OLAP explorations. Inf. Syst., 82:148–163, 2019. 32

work page 2019
[5]

Djedaini, N

M. Djedaini, N. Labroche, P. Marcel, and V. Peralta. Detecting user focus in OLAP analyses. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24- 27, 2017, Proceedings, pages 105–119, 2017

work page 2017
[6]

Drushku, N

K. Drushku, N. Labroche, P. Marcel, and V. Peralta. Interest-based rec- ommendations for business intelligence users. To appear in Information Systems, 2019. https://doi.org/10.1016/j.is.2018.08.004

work page doi:10.1016/j.is.2018.08.004 2019
[7]

Eirinaki, S

M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collabo- rative database exploration. IEEE Trans. Knowl. Data Eng. , 26(7):1778– 1790, 2014

work page 2014
[8]

Feurer, A

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Eﬃcient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28 , pages 2962–2970. Curran Associates, Inc., 2015

work page 2015
[9]

Huang, A

J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨ olkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006 , pages 601–608, 2006

work page 2006
[10]

Idreos, O

S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data explo- ration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 , pages 277–281, 2015

work page 2015
[11]

S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Re- sults from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Con- ference 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , pages 281–293, 2016

work page 2016
[12]

Khoussainova, Y

N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010

work page 2010
[13]

G. Kul, D. T. A. Luong, T. Xie, V. Chandola, O. Kennedy, and S. J. Upadhyaya. Similarity metrics for SQL query clustering. IEEE Trans. Knowl. Data Eng. , 30(12):2408–2420, 2018

work page 2018
[14]

H. V. Nguyen, K. B¨ ohm, F. Becker, B. Goldman, G. Hinkel, and E. M¨ uller. Identifying user interests within the data space - a case study with sky- server. In EDBT, pages 641–652. OpenProceedings.org, 2015

work page 2015
[15]

S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, 2010. 33

work page 2010
[16]

Peralta, W

V. Peralta, W. Verdeaux, Y. Raimont, and P. Marcel. Qualitative analysis of the sqlshareworkload for session segmentation. In Proceedings of the 21st International Workshop on Design, Optimization, Languages and Analyti- cal Processing of Big Data, co-located with EDBT/ICDT Joint Conference, DOLAP@EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019. , 2019

work page 2019
[17]

Ratner, S

A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´ e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269–282, 2017

work page 2017
[18]

Romero, P

O. Romero, P. Marcel, A. Abell´ o, V. Peralta, and L. Bellatreche. Describing analytical sessions using a multidimensional algebra. In Data Warehousing and Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse, France, August 29-September 2,2011. Proceedings , pages 224– 239, 2011

work page 2011
[19]

Salvador and P

S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In 16th IEEE Interna- tional Conference on Tools with Artiﬁcial Intelligence (ICTAI 2004), 15-17 November 2004, Boca Raton, FL, USA , pages 576–584, 2004

work page 2004
[20]

Satopaa, J

V. Satopaa, J. R. Albrecht, D. E. Irwin, and B. Raghavan. Finding a ”kneedle” in a haystack: Detecting knee points in system behavior. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Min- nesota, USA, pages 166–171, 2011

work page 2011
[21]

Singh, J

V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebe- deva, and B. Yanny. Skyserver traﬃc report - the ﬁrst ﬁve years. Technical report, December 2006

work page 2006
[22]

Sugiyama, S

M. Sugiyama, S. Nakajima, H. Kashima, P. von B¨ unau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada,...

work page 2007
[23]

van den Brink, R

H. van den Brink, R. van der Leek, and J. Visser. Quality assessment for embedded SQL. In SCAM, pages 163–170. IEEE Computer Society, 2007

work page 2007
[24]

Vashistha and S

A. Vashistha and S. Jain. Measuring query complexity in sqlshare workload. https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf

work page
[25]

R. W. White. Interactions with Search Systems . Cambridge University Press, 2016

work page 2016
[26]

M. Wong, B. Bhattarai, and R. Singh. Characterization and analysis of usage patterns in large multimedia websites. Technical report, 2006. 34

work page 2006

[1] [1]

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004

work page 2004

[2] [2]

B. D. Bhattarai, M. Wong, and R. Singh. Discovering user information goals with semantic website media modeling. In MMM (1) , volume 4351 of Lecture Notes in Computer Science , pages 364–375. Springer, 2007

work page 2007

[3] [3]

Chaudhuri and V

S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007 , pages 3–14, 2007

work page 2007

[4] [4]

Djedaini, K

M. Djedaini, K. Drushku, N. Labroche, P. Marcel, V. Peralta, and W. Verdeaux. Automatic assessment of interactive OLAP explorations. Inf. Syst., 82:148–163, 2019. 32

work page 2019

[5] [5]

Djedaini, N

M. Djedaini, N. Labroche, P. Marcel, and V. Peralta. Detecting user focus in OLAP analyses. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24- 27, 2017, Proceedings, pages 105–119, 2017

work page 2017

[6] [6]

Drushku, N

K. Drushku, N. Labroche, P. Marcel, and V. Peralta. Interest-based rec- ommendations for business intelligence users. To appear in Information Systems, 2019. https://doi.org/10.1016/j.is.2018.08.004

work page doi:10.1016/j.is.2018.08.004 2019

[7] [7]

Eirinaki, S

M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collabo- rative database exploration. IEEE Trans. Knowl. Data Eng. , 26(7):1778– 1790, 2014

work page 2014

[8] [8]

Feurer, A

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Eﬃcient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28 , pages 2962–2970. Curran Associates, Inc., 2015

work page 2015

[9] [9]

Huang, A

J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨ olkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006 , pages 601–608, 2006

work page 2006

[10] [10]

Idreos, O

S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data explo- ration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 , pages 277–281, 2015

work page 2015

[11] [11]

S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Re- sults from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Con- ference 2016, San Francisco, CA, USA, June 26 - July 01, 2016 , pages 281–293, 2016

work page 2016

[12] [12]

Khoussainova, Y

N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010

work page 2010

[13] [13]

G. Kul, D. T. A. Luong, T. Xie, V. Chandola, O. Kennedy, and S. J. Upadhyaya. Similarity metrics for SQL query clustering. IEEE Trans. Knowl. Data Eng. , 30(12):2408–2420, 2018

work page 2018

[14] [14]

H. V. Nguyen, K. B¨ ohm, F. Becker, B. Goldman, G. Hinkel, and E. M¨ uller. Identifying user interests within the data space - a case study with sky- server. In EDBT, pages 641–652. OpenProceedings.org, 2015

work page 2015

[15] [15]

S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, 2010. 33

work page 2010

[16] [16]

Peralta, W

V. Peralta, W. Verdeaux, Y. Raimont, and P. Marcel. Qualitative analysis of the sqlshareworkload for session segmentation. In Proceedings of the 21st International Workshop on Design, Optimization, Languages and Analyti- cal Processing of Big Data, co-located with EDBT/ICDT Joint Conference, DOLAP@EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019. , 2019

work page 2019

[17] [17]

Ratner, S

A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´ e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269–282, 2017

work page 2017

[18] [18]

Romero, P

O. Romero, P. Marcel, A. Abell´ o, V. Peralta, and L. Bellatreche. Describing analytical sessions using a multidimensional algebra. In Data Warehousing and Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse, France, August 29-September 2,2011. Proceedings , pages 224– 239, 2011

work page 2011

[19] [19]

Salvador and P

S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In 16th IEEE Interna- tional Conference on Tools with Artiﬁcial Intelligence (ICTAI 2004), 15-17 November 2004, Boca Raton, FL, USA , pages 576–584, 2004

work page 2004

[20] [20]

Satopaa, J

V. Satopaa, J. R. Albrecht, D. E. Irwin, and B. Raghavan. Finding a ”kneedle” in a haystack: Detecting knee points in system behavior. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Min- nesota, USA, pages 166–171, 2011

work page 2011

[21] [21]

Singh, J

V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebe- deva, and B. Yanny. Skyserver traﬃc report - the ﬁrst ﬁve years. Technical report, December 2006

work page 2006

[22] [22]

Sugiyama, S

M. Sugiyama, S. Nakajima, H. Kashima, P. von B¨ unau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada,...

work page 2007

[23] [23]

van den Brink, R

H. van den Brink, R. van der Leek, and J. Visser. Quality assessment for embedded SQL. In SCAM, pages 163–170. IEEE Computer Society, 2007

work page 2007

[24] [24]

Vashistha and S

A. Vashistha and S. Jain. Measuring query complexity in sqlshare workload. https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf

work page

[25] [25]

R. W. White. Interactions with Search Systems . Cambridge University Press, 2016

work page 2016

[26] [26]

M. Wong, B. Bhattarai, and R. Singh. Characterization and analysis of usage patterns in large multimedia websites. Technical report, 2006. 34

work page 2006