Discriminability Tests for Visualization Effectiveness and Scalability
Pith reviewed 2026-05-24 15:48 UTC · model grok-4.3
The pith
MS-SSIM image similarity scores can approximate human judgments of how discriminable visualizations are across different datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Multi-Scale Structural Similarity Index applied to rendered visualization images captures both low-level and high-level differences, and its scores align with human similarity judgments and empirical effectiveness measures in the tested cases, thereby providing a way to evaluate and select visualizations based on their discriminability.
What carries the argument
The Multi-Scale Structural Similarity Index (MS-SSIM) applied to pairs of rendered visualization images to quantify how much the visual output changes with data changes.
Load-bearing premise
That MS-SSIM scores on rendered visualization images will continue to track human discriminability judgments when applied to new chart types, new data distributions, or different rendering parameters beyond the two studies described.
What would settle it
A new study on a different visualization type, such as parallel coordinates or treemaps, where human similarity judgments diverge substantially from MS-SSIM scores on the same image pairs.
Figures
read the original abstract
The scalability of a particular visualization approach is limited by the ability for people to discern differences between plots made with different datasets. Ideally, when the data changes, the visualization changes in perceptible ways. This relation breaks down when there is a mismatch between the encoding and the character of the dataset being viewed. Unfortunately, visualizations are often designed and evaluated without fully exploring how they will respond to a wide variety of datasets. We explore the use of an image similarity measure, the Multi-Scale Structural Similarity Index (MS-SSIM), for testing the discriminability of a data visualization across a variety of datasets. MS-SSIM is able to capture the similarity of two visualizations across multiple scales, including low level granular changes and high level patterns. Significant data changes that are not captured by the MS-SSIM indicate visualizations of low discriminability and effectiveness. The measure's utility is demonstrated with two empirical studies. In the first, we compare human similarity judgments and MS-SSIM scores for a collection of scatterplots. In the second, we compute the discriminability values for a set of basic visualizations and compare them with empirical measurements of effectiveness. In both cases, the analyses show that the computational measure is able to approximate empirical results. Our approach can be used to rank competing encodings on their discriminability and to aid in selecting visualizations for a particular type of data distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the Multi-Scale Structural Similarity Index (MS-SSIM) serves as a computational proxy for human discriminability in visualizations, capturing both low-level and high-level changes across datasets. This is supported by two studies: (1) comparison of MS-SSIM scores against human similarity judgments on scatterplots, and (2) computation of discriminability values for basic visualizations compared against prior empirical effectiveness measures. The authors conclude that the measure approximates empirical results and can rank encodings or guide visualization selection for given data distributions.
Significance. If the central approximation claim holds under broader conditions, the work would provide a scalable, automated method for evaluating visualization discriminability without repeated user studies, which is a practical strength for the visualization community. The use of an established, parameter-free image metric (MS-SSIM) and the explicit comparison to human data are positive features. However, the narrow scope of validation limits the immediate significance.
major comments (3)
- [Abstract / Study 1] Abstract and Study 1 description: the claim that MS-SSIM 'approximates empirical results' is central but rests on an unspecified collection of scatterplots, human judgment protocol, number of participants, and correlation statistics (e.g., Pearson r, p-values, effect sizes). Without these, the strength of the approximation cannot be assessed.
- [Study 2] Study 2: the comparison of computed discriminability values to 'empirical measurements of effectiveness' is load-bearing for the ranking claim, yet the manuscript provides no information on which prior effectiveness studies were used, how visualizations were rendered (point size, color maps, aspect ratio), or the exact matching procedure between MS-SSIM scores and effectiveness rankings.
- [Discussion / Conclusion] Generalization paragraph (end of abstract and discussion): the assertion that the approach 'can be used to rank competing encodings' for 'a particular type of data distribution' assumes transfer beyond the two tested regimes (scatterplots; basic chart types). No cross-validation on new chart families, data distribution families, or rendering parameters is reported, making the broader utility claim unsupported by the presented evidence.
minor comments (2)
- [Introduction] Notation: MS-SSIM is introduced without an equation or reference to the original Wang et al. formulation; a brief definition or citation would improve clarity.
- [Figures] Figure captions: the scatterplot examples and effectiveness comparison plots lack axis labels or legends indicating the exact data distributions or encoding parameters used.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The feedback identifies important areas where additional detail and qualification are needed to strengthen the manuscript. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract / Study 1] Abstract and Study 1 description: the claim that MS-SSIM 'approximates empirical results' is central but rests on an unspecified collection of scatterplots, human judgment protocol, number of participants, and correlation statistics (e.g., Pearson r, p-values, effect sizes). Without these, the strength of the approximation cannot be assessed.
Authors: We agree that the abstract is too concise on these points and that the manuscript should make the supporting details more accessible. In the revision we will expand the abstract to summarize the scatterplot collection size, human judgment protocol, participant count, and key correlation statistics, and we will add explicit cross-references to the full methodological description in the Study 1 section. revision: yes
-
Referee: [Study 2] Study 2: the comparison of computed discriminability values to 'empirical measurements of effectiveness' is load-bearing for the ranking claim, yet the manuscript provides no information on which prior effectiveness studies were used, how visualizations were rendered (point size, color maps, aspect ratio), or the exact matching procedure between MS-SSIM scores and effectiveness rankings.
Authors: We acknowledge that these implementation details are currently underspecified. The revised manuscript will include a dedicated subsection describing the referenced prior effectiveness studies, the exact rendering parameters employed, and the procedure used to align MS-SSIM scores with the empirical rankings. revision: yes
-
Referee: [Discussion / Conclusion] Generalization paragraph (end of abstract and discussion): the assertion that the approach 'can be used to rank competing encodings' for 'a particular type of data distribution' assumes transfer beyond the two tested regimes (scatterplots; basic chart types). No cross-validation on new chart families, data distribution families, or rendering parameters is reported, making the broader utility claim unsupported by the presented evidence.
Authors: The current claims are grounded only in the two reported studies. We will revise the abstract and discussion to qualify the generalization statements, explicitly note the absence of cross-validation on additional chart families or distributions, and add a limitations paragraph discussing the scope of the evidence. revision: partial
Circularity Check
No circularity: MS-SSIM applied as fixed external metric and validated on independent human data
full rationale
The paper applies the pre-existing MS-SSIM image metric (not fitted or redefined here) to rendered chart images and directly compares the resulting scores against separate human similarity judgments (Study 1) and prior empirical effectiveness measurements (Study 2). The central claim—that MS-SSIM approximates discriminability—rests on these external benchmarks rather than any equation that reduces the output to a parameter defined from the same inputs, any self-citation load-bearing the uniqueness of the method, or an ansatz smuggled from prior author work. No derivation step equates a prediction to its own fitting data by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MS-SSIM scores on rendered plots will track human discriminability judgments for the tested visualization types and data distributions.
Reference graph
Works this paper leans on
-
[1]
M. J. Alam, S. G. Kobourov, and S. Veeramoni. Quantitative measures for cartogram generation techniques. Computer Graphics Forum, 34(3):351– 360, 2015
work page 2015
-
[2]
L. Bartram and M. C. Stone. Whisper, don’t scream: Grids and trans- parency. IEEE Trans. Visualization and Computer Graphics, 17(10):1444– 1458, 2011
work page 2011
-
[3]
M. Behrisch, M. Blumenschein, N. W. Kim, L. Shao, M. El-Assady, J. Fuchs, D. Seebacher, A. Diehl, U. Brandes, H. Pfister, et al. Qual- ity metrics for information visualization. Computer Graphics Forum, 37(3):625–662, 2018
work page 2018
-
[4]
E. Bertini and G. Santucci. Quality metrics for 2D scatterplot graphics: Automatically reducing visual clutter. In Int. Symp. on Smart Graphics, pp. 77–89. Springer, 2004
work page 2004
-
[5]
R. Brath. Metrics for effective information visualization. In Proc. of IEEE Symp. on Information Visualization, pp. 108–111. IEEE, 1997
work page 1997
-
[6]
A. Brychtov´a and A. C ¸¨oltekin. The effect of spatial distance on the dis- criminability of colors in maps. Cartography and Geographic Information Science, 44(3):229–245, 2017
work page 2017
-
[7]
M. Chen, D. Ebert, H. Hagen, R. S. Laramee, R. van Liere, K.-L. Ma, W. Ribarsky, G. Scheuermann, and D. Silver. Data, information, and knowledge in visualization. IEEE Computer Graphics and Applications, 29(1):12–19, 2009
work page 2009
-
[8]
M. Chen and H. J¨anicke. An information-theoretic framework for visualiza- tion. IEEE Trans. Visualization and Computer Graphics, 16(6):1206–15, 2010
work page 2010
-
[9]
C. Demiralp, M. S. Bernstein, and J. Heer. Learning perceptual kernels for visualization design. IEEE Trans. Visualization and Computer Graphics, 20(12):1933–1943, 2014
work page 1933
-
[10]
C. Demiralp, C. E. Scheidegger, G. L. Kindlmann, D. H. Laidlaw, and J. Heer. Visual embedding: A model for visualization. IEEE Computer Graphics and Applications, 34(1):10–15, 2014
work page 2014
- [11]
-
[12]
G. Ellis and A. Dix. The plot, the clutter, the sampling and its lens: occlusion measures for automatic clutter reduction. InProc. of the Working Conf. on Advanced Visual Interfaces, pp. 266–269. ACM, 2006
work page 2006
-
[13]
C. C. Gramazio, D. H. Laidlaw, and K. B. Schloss. Colorgorical: Creating discriminable and preferable color palettes for information visualization. IEEE Trans. Visualization and Computer Graphics, 23(1):521–530, 2017
work page 2017
-
[14]
S. Haroz and D. Whitney. How Capacity Limits of Attention Influence Information Visualization Effectiveness. IEEE Trans. Visualization and Computer Graphics, 18(12):2402–2410, dec 2012
work page 2012
-
[15]
J. Harper and M. Agrawala. Deconstructing and restyling D3 visual- izations. In Proc. of the ACM Symp. on User Interface Software and Technology, pp. 253–262. ACM, 2014
work page 2014
-
[16]
J. Heer and M. Bostock. Crowdsourcing graphical perception: Using mechanical turk to assess visualization design. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 203–212. ACM, 2010
work page 2010
-
[17]
H. Hofmann, L. Follett, M. Majumder, and D. Cook. Graphical tests for power comparison of competing designs. IEEE Trans. Visualization and Computer Graphics, 18(12):2441–2448, 2012
work page 2012
- [18]
-
[19]
Vega Lite example gallery, 2018
Interactive Data Lab. Vega Lite example gallery, 2018. https://vega.github.io/vega-lite/examples/
work page 2018
-
[20]
H. J¨anicke and M. Chen. A salience-based quality metric for visualization. Computer Graphics Forum, 29(3):1183–1192, 2010
work page 2010
-
[21]
S. Johansson and J. Johansson. Interactive dimensionality reduction through user-defined combinations of quality metrics. IEEE Trans. Visual- ization and Computer Graphics, 15(6):993–1000, 2009
work page 2009
- [22]
-
[23]
G. Kindlmann and C. Scheidegger. An algebraic process for visualization design. IEEE Trans. Visualization and Computer Graphics, 20(12):2181– 2190, Dec. 2014
work page 2014
-
[24]
S. Lin, J. Fortuna, C. Kulkarni, M. Stone, and J. Heer. Selecting semantically-resonant colors for data visualization. Computer Graph- ics Forum, 32(3pt4):401–410, 2013
work page 2013
- [25]
-
[26]
J. Matejka and G. Fitzmaurice. Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 1290–1294. ACM, 2017
work page 2017
-
[27]
G. G. M´endez, M. A. Nacenta, and S. Vandenheste. iV oLVER: Interactive visual language for visualization extraction and reconstruction. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 4073–
-
[28]
M. J. Menne, I. Durre, R. S. V ose, B. E. Gleason, and T. G. Houston. An overview of the global historical climatology network-daily database. Journal of Atmospheric and Oceanic Technology, 29(7):897–910, 2012
work page 2012
-
[29]
A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):193–201, 2009
work page 2009
-
[30]
T. Munzner. A nested model for visualization design and validation. IEEE Trans. Visualization and Computer Graphics, 15(6):921–928, Nov. 2009
work page 2009
-
[31]
T. Munzner. Visualization Analysis & Design. CRC Press, 2014
work page 2014
-
[32]
A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba. Does where you gaze on an image affect your perception of quality? Applying visual attention to image quality metric. In Proc. IEEE Int. Conf. on Image Processing, vol. 2, pp. II/169–II/172. IEEE, 2007
work page 2007
-
[33]
A. V . Pandey, J. Krause, C. Felix, J. Boy, and E. Bertini. Towards under- standing human similarity perception in the analysis of large sets of scatter plots. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 3659–3669. ACM, 2016
work page 2016
-
[34]
R. A. Rensink. On the prospects for a science of visualization. In Hand- book of Human Centric Visualization, pp. 147–175. Springer, 2014
work page 2014
-
[35]
G. Ryan, A. Mosca, R. Chang, and E. Wu. At a glance: Pixel approximate entropy as a measure of line chart complexity. IEEE Trans. Visualization and Computer Graphics, 25(1):872–881, 2019
work page 2019
-
[36]
D. A. Szafir. Modeling color difference for visualization design. IEEE Trans. Visualization and Computer Graphics, 24(1):392–401, 2018
work page 2018
-
[37]
A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnork, and D. Keim. Combining automated analysis and visualiza- tion techniques for effective exploration of high-dimensional data. InProc. IEEE Symp. Visual Analytics Science and Technology, pp. 59–66, 2009
work page 2009
- [38]
-
[39]
E. R. Tufte. The Visual Display of Quantitative Information. Graphics Press, 2nd ed., 2001
work page 2001
-
[40]
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600–612, 2004
work page 2004
-
[41]
Z. Wang and Q. Li. Information content weighting for perceptual image quality assessment. IEEE Trans. Image Processing, 20(5):1185–1198, 2011
work page 2011
-
[42]
Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Proc. of the Asilomar Conf. on Signals, Systems Computers, vol. 2, pp. 1398–1402 V ol.2, Nov 2003
work page 2003
-
[43]
M. Wattenberg and D. Fisher. Analyzing perceptual organization in infor- mation graphics. Information Visualization, 3(2):123–133, 2004
work page 2004
-
[44]
H. Wickham, D. Cook, H. Hofmann, and A. Buja. Graphical inference for Infovis. IEEE Trans. Visualization and Computer Graphics, 16(6):973–9, Jan. 2010
work page 2010
-
[45]
L. Wilkinson, A. Anand, and R. Grossman. Graph-theoretic scagnostics. In IEEE Symp. on Information Visualization, 2005. INFOVIS 2005., pp. 157–164. IEEE, 2005
work page 2005
-
[46]
F. Yang, L. T. Harrison, R. A. Rensink, S. L. Franconeri, and R. Chang. Correlation judgment and visualization features: A comparative study. IEEE Trans. Visualization and Computer Graphics , 25(3):1474–1488, 2019
work page 2019
-
[47]
V . Yoghourdjian, T. Dwyer, K. Klein, K. Marriott, and M. Wybrow. Graph thumbnails: Identifying and comparing multiple graphs at a glance. IEEE Trans. Visualization and Computer Graphics, 24(12):3081–3095, 2018
work page 2018
- [48]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.