pith. sign in

arxiv: 1907.11358 · v1 · pith:WVEBSBDMnew · submitted 2019-07-26 · 💻 cs.HC

Discriminability Tests for Visualization Effectiveness and Scalability

Pith reviewed 2026-05-24 15:48 UTC · model grok-4.3

classification 💻 cs.HC
keywords visualizationdiscriminabilityMS-SSIMimage similarityscalabilityeffectivenesshuman perceptionchart design
0
0 comments X

The pith

MS-SSIM image similarity scores can approximate human judgments of how discriminable visualizations are across different datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a computational image similarity measure, MS-SSIM, can stand in for human perception when checking if visualizations change noticeably when the underlying data changes. It runs two studies: one matching MS-SSIM scores against people's similarity ratings on scatterplots, and another comparing discriminability scores to measured effectiveness of basic chart types. The results indicate the measure tracks empirical findings closely enough to rank different visual encodings by how well they reveal data differences. This matters because designers often pick visualizations without checking their behavior over many possible datasets, and a fast computational test could help avoid poor matches between encoding and data distribution.

Core claim

The central claim is that the Multi-Scale Structural Similarity Index applied to rendered visualization images captures both low-level and high-level differences, and its scores align with human similarity judgments and empirical effectiveness measures in the tested cases, thereby providing a way to evaluate and select visualizations based on their discriminability.

What carries the argument

The Multi-Scale Structural Similarity Index (MS-SSIM) applied to pairs of rendered visualization images to quantify how much the visual output changes with data changes.

Load-bearing premise

That MS-SSIM scores on rendered visualization images will continue to track human discriminability judgments when applied to new chart types, new data distributions, or different rendering parameters beyond the two studies described.

What would settle it

A new study on a different visualization type, such as parallel coordinates or treemaps, where human similarity judgments diverge substantially from MS-SSIM scores on the same image pairs.

Figures

Figures reproduced from arXiv: 1907.11358 by Christopher Collins, Rafael Veras.

Figure 1
Figure 1. Figure 1: Data and image similarity measures: Mean-Squared Error (MSE), Structural Similarity Index (SSIM), and Multi-Scale SSIM (MS-SSIM). [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The effect of grids on SSIM for scatterplots of the Iris dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SSIM applied on YUV image representations. Each row shows [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical and MS-SSIM clusterings of the scatterplots from the study of Pandey et al. [33]. MS-SSIM parameters were tuned to the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Images generated for the global discriminability test. Left: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairs of colored scatterplots (y x color) with y values swapped between two categories. a) and b) have 3 categories in total, while c) and d) have 30 categories. These pairs (a,b) and (c,d) are used to measure the visual discriminability of two categories (other categories fixed) along one variable [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Global and Local discriminability scores computed with MS-SSIM ( [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Discriminability rankings of encodings (divided by data property) derived from discriminability scores. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

The scalability of a particular visualization approach is limited by the ability for people to discern differences between plots made with different datasets. Ideally, when the data changes, the visualization changes in perceptible ways. This relation breaks down when there is a mismatch between the encoding and the character of the dataset being viewed. Unfortunately, visualizations are often designed and evaluated without fully exploring how they will respond to a wide variety of datasets. We explore the use of an image similarity measure, the Multi-Scale Structural Similarity Index (MS-SSIM), for testing the discriminability of a data visualization across a variety of datasets. MS-SSIM is able to capture the similarity of two visualizations across multiple scales, including low level granular changes and high level patterns. Significant data changes that are not captured by the MS-SSIM indicate visualizations of low discriminability and effectiveness. The measure's utility is demonstrated with two empirical studies. In the first, we compare human similarity judgments and MS-SSIM scores for a collection of scatterplots. In the second, we compute the discriminability values for a set of basic visualizations and compare them with empirical measurements of effectiveness. In both cases, the analyses show that the computational measure is able to approximate empirical results. Our approach can be used to rank competing encodings on their discriminability and to aid in selecting visualizations for a particular type of data distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the Multi-Scale Structural Similarity Index (MS-SSIM) serves as a computational proxy for human discriminability in visualizations, capturing both low-level and high-level changes across datasets. This is supported by two studies: (1) comparison of MS-SSIM scores against human similarity judgments on scatterplots, and (2) computation of discriminability values for basic visualizations compared against prior empirical effectiveness measures. The authors conclude that the measure approximates empirical results and can rank encodings or guide visualization selection for given data distributions.

Significance. If the central approximation claim holds under broader conditions, the work would provide a scalable, automated method for evaluating visualization discriminability without repeated user studies, which is a practical strength for the visualization community. The use of an established, parameter-free image metric (MS-SSIM) and the explicit comparison to human data are positive features. However, the narrow scope of validation limits the immediate significance.

major comments (3)
  1. [Abstract / Study 1] Abstract and Study 1 description: the claim that MS-SSIM 'approximates empirical results' is central but rests on an unspecified collection of scatterplots, human judgment protocol, number of participants, and correlation statistics (e.g., Pearson r, p-values, effect sizes). Without these, the strength of the approximation cannot be assessed.
  2. [Study 2] Study 2: the comparison of computed discriminability values to 'empirical measurements of effectiveness' is load-bearing for the ranking claim, yet the manuscript provides no information on which prior effectiveness studies were used, how visualizations were rendered (point size, color maps, aspect ratio), or the exact matching procedure between MS-SSIM scores and effectiveness rankings.
  3. [Discussion / Conclusion] Generalization paragraph (end of abstract and discussion): the assertion that the approach 'can be used to rank competing encodings' for 'a particular type of data distribution' assumes transfer beyond the two tested regimes (scatterplots; basic chart types). No cross-validation on new chart families, data distribution families, or rendering parameters is reported, making the broader utility claim unsupported by the presented evidence.
minor comments (2)
  1. [Introduction] Notation: MS-SSIM is introduced without an equation or reference to the original Wang et al. formulation; a brief definition or citation would improve clarity.
  2. [Figures] Figure captions: the scatterplot examples and effectiveness comparison plots lack axis labels or legends indicating the exact data distributions or encoding parameters used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. The feedback identifies important areas where additional detail and qualification are needed to strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract / Study 1] Abstract and Study 1 description: the claim that MS-SSIM 'approximates empirical results' is central but rests on an unspecified collection of scatterplots, human judgment protocol, number of participants, and correlation statistics (e.g., Pearson r, p-values, effect sizes). Without these, the strength of the approximation cannot be assessed.

    Authors: We agree that the abstract is too concise on these points and that the manuscript should make the supporting details more accessible. In the revision we will expand the abstract to summarize the scatterplot collection size, human judgment protocol, participant count, and key correlation statistics, and we will add explicit cross-references to the full methodological description in the Study 1 section. revision: yes

  2. Referee: [Study 2] Study 2: the comparison of computed discriminability values to 'empirical measurements of effectiveness' is load-bearing for the ranking claim, yet the manuscript provides no information on which prior effectiveness studies were used, how visualizations were rendered (point size, color maps, aspect ratio), or the exact matching procedure between MS-SSIM scores and effectiveness rankings.

    Authors: We acknowledge that these implementation details are currently underspecified. The revised manuscript will include a dedicated subsection describing the referenced prior effectiveness studies, the exact rendering parameters employed, and the procedure used to align MS-SSIM scores with the empirical rankings. revision: yes

  3. Referee: [Discussion / Conclusion] Generalization paragraph (end of abstract and discussion): the assertion that the approach 'can be used to rank competing encodings' for 'a particular type of data distribution' assumes transfer beyond the two tested regimes (scatterplots; basic chart types). No cross-validation on new chart families, data distribution families, or rendering parameters is reported, making the broader utility claim unsupported by the presented evidence.

    Authors: The current claims are grounded only in the two reported studies. We will revise the abstract and discussion to qualify the generalization statements, explicitly note the absence of cross-validation on additional chart families or distributions, and add a limitations paragraph discussing the scope of the evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: MS-SSIM applied as fixed external metric and validated on independent human data

full rationale

The paper applies the pre-existing MS-SSIM image metric (not fitted or redefined here) to rendered chart images and directly compares the resulting scores against separate human similarity judgments (Study 1) and prior empirical effectiveness measurements (Study 2). The central claim—that MS-SSIM approximates discriminability—rests on these external benchmarks rather than any equation that reduces the output to a parameter defined from the same inputs, any self-citation load-bearing the uniqueness of the method, or an ansatz smuggled from prior author work. No derivation step equates a prediction to its own fitting data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the untested transfer of an image-quality metric to visualization perception and on the assumption that the chosen datasets and chart renderings are representative; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption MS-SSIM scores on rendered plots will track human discriminability judgments for the tested visualization types and data distributions.
    Invoked when the authors conclude the measure approximates empirical results and can be used to rank encodings.

pith-pipeline@v0.9.0 · 5762 in / 1230 out tokens · 20596 ms · 2026-05-24T15:48:37.680671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    M. J. Alam, S. G. Kobourov, and S. Veeramoni. Quantitative measures for cartogram generation techniques. Computer Graphics Forum, 34(3):351– 360, 2015

  2. [2]

    Bartram and M

    L. Bartram and M. C. Stone. Whisper, don’t scream: Grids and trans- parency. IEEE Trans. Visualization and Computer Graphics, 17(10):1444– 1458, 2011

  3. [3]

    Behrisch, M

    M. Behrisch, M. Blumenschein, N. W. Kim, L. Shao, M. El-Assady, J. Fuchs, D. Seebacher, A. Diehl, U. Brandes, H. Pfister, et al. Qual- ity metrics for information visualization. Computer Graphics Forum, 37(3):625–662, 2018

  4. [4]

    Bertini and G

    E. Bertini and G. Santucci. Quality metrics for 2D scatterplot graphics: Automatically reducing visual clutter. In Int. Symp. on Smart Graphics, pp. 77–89. Springer, 2004

  5. [5]

    R. Brath. Metrics for effective information visualization. In Proc. of IEEE Symp. on Information Visualization, pp. 108–111. IEEE, 1997

  6. [6]

    Brychtov´a and A

    A. Brychtov´a and A. C ¸¨oltekin. The effect of spatial distance on the dis- criminability of colors in maps. Cartography and Geographic Information Science, 44(3):229–245, 2017

  7. [7]

    M. Chen, D. Ebert, H. Hagen, R. S. Laramee, R. van Liere, K.-L. Ma, W. Ribarsky, G. Scheuermann, and D. Silver. Data, information, and knowledge in visualization. IEEE Computer Graphics and Applications, 29(1):12–19, 2009

  8. [8]

    Chen and H

    M. Chen and H. J¨anicke. An information-theoretic framework for visualiza- tion. IEEE Trans. Visualization and Computer Graphics, 16(6):1206–15, 2010

  9. [9]

    Demiralp, M

    C. Demiralp, M. S. Bernstein, and J. Heer. Learning perceptual kernels for visualization design. IEEE Trans. Visualization and Computer Graphics, 20(12):1933–1943, 2014

  10. [10]

    Demiralp, C

    C. Demiralp, C. E. Scheidegger, G. L. Kindlmann, D. H. Laidlaw, and J. Heer. Visual embedding: A model for visualization. IEEE Computer Graphics and Applications, 34(1):10–15, 2014

  11. [11]

    Dunne, S

    C. Dunne, S. I. Ross, B. Shneiderman, and M. Martino. Readability metric feedback for aiding node-link visualization designers. IBM Journal of Research and Development, 59(2/3):14–1, 2015

  12. [12]

    Ellis and A

    G. Ellis and A. Dix. The plot, the clutter, the sampling and its lens: occlusion measures for automatic clutter reduction. InProc. of the Working Conf. on Advanced Visual Interfaces, pp. 266–269. ACM, 2006

  13. [13]

    C. C. Gramazio, D. H. Laidlaw, and K. B. Schloss. Colorgorical: Creating discriminable and preferable color palettes for information visualization. IEEE Trans. Visualization and Computer Graphics, 23(1):521–530, 2017

  14. [14]

    Haroz and D

    S. Haroz and D. Whitney. How Capacity Limits of Attention Influence Information Visualization Effectiveness. IEEE Trans. Visualization and Computer Graphics, 18(12):2402–2410, dec 2012

  15. [15]

    Harper and M

    J. Harper and M. Agrawala. Deconstructing and restyling D3 visual- izations. In Proc. of the ACM Symp. on User Interface Software and Technology, pp. 253–262. ACM, 2014

  16. [16]

    Heer and M

    J. Heer and M. Bostock. Crowdsourcing graphical perception: Using mechanical turk to assess visualization design. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 203–212. ACM, 2010

  17. [17]

    Hofmann, L

    H. Hofmann, L. Follett, M. Majumder, and D. Cook. Graphical tests for power comparison of competing designs. IEEE Trans. Visualization and Computer Graphics, 18(12):2441–2448, 2012

  18. [18]

    Holten, J

    D. Holten, J. J. V . Wijk, and J.-B. Martens. A perceptually based spectral model for isotropic textures. ACM Trans. Applied Perception (TAP) , 3(4):376–398, 2006

  19. [19]

    Vega Lite example gallery, 2018

    Interactive Data Lab. Vega Lite example gallery, 2018. https://vega.github.io/vega-lite/examples/

  20. [20]

    J¨anicke and M

    H. J¨anicke and M. Chen. A salience-based quality metric for visualization. Computer Graphics Forum, 29(3):1183–1192, 2010

  21. [21]

    Johansson and J

    S. Johansson and J. Johansson. Interactive dimensionality reduction through user-defined combinations of quality metrics. IEEE Trans. Visual- ization and Computer Graphics, 15(6):993–1000, 2009

  22. [22]

    Kim and J

    Y . Kim and J. Heer. Assessing effects of task and data distribution on the effectiveness of visual encodings. Computer Graphics Forum, 37(3):157– 167, 2018

  23. [23]

    Kindlmann and C

    G. Kindlmann and C. Scheidegger. An algebraic process for visualization design. IEEE Trans. Visualization and Computer Graphics, 20(12):2181– 2190, Dec. 2014

  24. [24]

    S. Lin, J. Fortuna, C. Kulkarni, M. Stone, and J. Heer. Selecting semantically-resonant colors for data visualization. Computer Graph- ics Forum, 32(3pt4):401–410, 2013

  25. [25]

    Liu and J

    Y . Liu and J. Heer. Somewhere over the rainbow: An empirical assessment of quantitative colormaps. InProc. of the SIGCHI Conf. on Human Factors in Computing Systems. ACM, 2018

  26. [26]

    Matejka and G

    J. Matejka and G. Fitzmaurice. Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 1290–1294. ACM, 2017

  27. [27]

    G. G. M´endez, M. A. Nacenta, and S. Vandenheste. iV oLVER: Interactive visual language for visualization extraction and reconstruction. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 4073–

  28. [28]

    M. J. Menne, I. Durre, R. S. V ose, B. E. Gleason, and T. G. Houston. An overview of the global historical climatology network-daily database. Journal of Atmospheric and Oceanic Technology, 29(7):897–910, 2012

  29. [29]

    A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):193–201, 2009

  30. [30]

    T. Munzner. A nested model for visualization design and validation. IEEE Trans. Visualization and Computer Graphics, 15(6):921–928, Nov. 2009

  31. [31]

    T. Munzner. Visualization Analysis & Design. CRC Press, 2014

  32. [32]

    Ninassi, O

    A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba. Does where you gaze on an image affect your perception of quality? Applying visual attention to image quality metric. In Proc. IEEE Int. Conf. on Image Processing, vol. 2, pp. II/169–II/172. IEEE, 2007

  33. [33]

    A. V . Pandey, J. Krause, C. Felix, J. Boy, and E. Bertini. Towards under- standing human similarity perception in the analysis of large sets of scatter plots. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pp. 3659–3669. ACM, 2016

  34. [34]

    R. A. Rensink. On the prospects for a science of visualization. In Hand- book of Human Centric Visualization, pp. 147–175. Springer, 2014

  35. [35]

    G. Ryan, A. Mosca, R. Chang, and E. Wu. At a glance: Pixel approximate entropy as a measure of line chart complexity. IEEE Trans. Visualization and Computer Graphics, 25(1):872–881, 2019

  36. [36]

    D. A. Szafir. Modeling color difference for visualization design. IEEE Trans. Visualization and Computer Graphics, 24(1):392–401, 2018

  37. [37]

    A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnork, and D. Keim. Combining automated analysis and visualiza- tion techniques for effective exploration of high-dimensional data. InProc. IEEE Symp. Visual Analytics Science and Technology, pp. 59–66, 2009

  38. [38]

    Tu and H

    Y . Tu and H. Shen. Visualizing changes of hierarchical data using treemaps. IEEE Trans. Visualization and Computer Graphics, 13(6):1286–1293, Nov 2007

  39. [39]

    E. R. Tufte. The Visual Display of Quantitative Information. Graphics Press, 2nd ed., 2001

  40. [40]

    Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600–612, 2004

  41. [41]

    Wang and Q

    Z. Wang and Q. Li. Information content weighting for perceptual image quality assessment. IEEE Trans. Image Processing, 20(5):1185–1198, 2011

  42. [42]

    Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Proc. of the Asilomar Conf. on Signals, Systems Computers, vol. 2, pp. 1398–1402 V ol.2, Nov 2003

  43. [43]

    Wattenberg and D

    M. Wattenberg and D. Fisher. Analyzing perceptual organization in infor- mation graphics. Information Visualization, 3(2):123–133, 2004

  44. [44]

    Wickham, D

    H. Wickham, D. Cook, H. Hofmann, and A. Buja. Graphical inference for Infovis. IEEE Trans. Visualization and Computer Graphics, 16(6):973–9, Jan. 2010

  45. [45]

    Wilkinson, A

    L. Wilkinson, A. Anand, and R. Grossman. Graph-theoretic scagnostics. In IEEE Symp. on Information Visualization, 2005. INFOVIS 2005., pp. 157–164. IEEE, 2005

  46. [46]

    F. Yang, L. T. Harrison, R. A. Rensink, S. L. Franconeri, and R. Chang. Correlation judgment and visualization features: A comparative study. IEEE Trans. Visualization and Computer Graphics , 25(3):1474–1488, 2019

  47. [47]

    Yoghourdjian, T

    V . Yoghourdjian, T. Dwyer, K. Klein, K. Marriott, and M. Wybrow. Graph thumbnails: Identifying and comparing multiple graphs at a glance. IEEE Trans. Visualization and Computer Graphics, 24(12):3081–3095, 2018

  48. [48]

    Zheng, H

    Z. Zheng, H. Cheng, Z. Zhang, Y . Zhao, and P. Wang. An alternative method for understanding user-chosen passwords. Security and Communi- cation Networks, 2018