Guiding the exploration of scatter plot data using motif-based interest measures

https://doi.org/10.1016/j.jvlc.2016.07.003Get rights and content

Abstract

Finding interesting patterns in large scatter plot spaces is a challenging problem and becomes even more difficult with increasing number of dimensions. Previous approaches for exploring large scatter plot spaces like e.g., the well-known Scagnostics approach, mainly focus on ranking scatter plots based on their global properties. However, often local patterns contribute significantly to the interestingness of a scatter plot. We are proposing a novel approach for the automatic determination of interesting views in scatter plot spaces based on analysis of local scatter plot segments. Specifically, we automatically classify similar local scatter plot segments, which we call scatter plot motifs. Inspired by the well-known tf×idf-approach from information retrieval, we compute local and global quality measures based on frequency properties of the local motifs. We show how we can use these to filter, rank and compare scatter plots and their incorporated motifs. We demonstrate the usefulness of our approach with synthetic and real-world data sets and showcase our data exploration tools that visualize the distribution of local scatter plot motifs in relation to a large overall scatter plot space.

Introduction

Nowadays, vast amounts of data are rapidly created in many application domains and thus the problem of effective and efficient access to large multivariate and high-dimensional data arises. While in the past, the storage capacity was the primary problem, today the challenges comprise tasks like detecting interesting patterns or correlations in large data sets. One solution is to apply suitable visualization techniques and search for hidden information within the data. Scatter plot visualizations are one of the most widely used and well-understood visual representations for bivariate data. They can also be applied for high-dimensional data via dimensionality reduction or the scatter plot matrix representation [1]. However, perceiving and finding interesting scatter plots in large scatter plot collections constitutes a severe challenge, especially when working with scatter plot matrices.

Manually searching through large amounts of data views is exhaustive and may become infeasible for high-dimensional data sets. Recent work in Visual Analytics has focused on computing interestingness measures, which can be used to filter and rank large data spaces to present the user a good starting point for exploration. Specifically, several previous approaches, such as [2], [3], [4], have focused on interestingness measures based on global properties of scatter plots for ranking and filtering. However, global interesting scores do not consider the impact of local patterns, which add to the overall interestingness of a scatter plot. Often, it is a combination of several different local scatter plot patterns which by their composition constitute interesting data views.

Here, we present a novel approach to discover interesting scatter plot views, which opposed to current quality metrics focuses on scatter plot interestingness derived from local data properties. We adapt a minimum spanning tree-based clustering technique for a non-parametric segmentation of scatter plots as data preprocessing. Next, we apply ideas from the image analysis domain to scatter plots. Specifically, we extract visual features as the basis for clustering local scatter plot segments into groups of similar patterns, called motifs. Consequently, we are able to compute an interestingness measure for scatter plots in terms of the distribution of occurring motifs. Our idea here is that visually discriminating motifs are considered of interest, since they can be quickly recognized by the human. We apply a Bag-of-Visual-Words [5] concept for scatter plots and transfer the idea of tf×idf-weighting to this domain. Thus, we can derive the interestingness of a local scatter plot motif based on its occurrence within a given scatter plot and in relation to the occurrence in other scatter plots of a scatter plot space. We make use of these local motif-based measurements to rank and filter large scatter plot spaces.

We claim the following technical contributions:

  • We adapt the minimum spanning tree-based clustering technique for a non-parametric segmentation of scatter plot diagrams.

  • We introduce a motif-based dictionary to assess the interestingness of local scatter plot patterns.

  • We define a global interestingness score based on the occurrence and similarity of local motifs.

This work is a revised version of an earlier conference article [6]. In this paper, we contribute several extensions to our initial approach as follows. First, we extended the related work by discussion of recent papers on visual abstractions of scatter plot spaces and generation of projection views from high-dimensional input data. Furthermore, we also introduce technical extensions which improve the exploration of local scatter plot patterns. Specifically, the distribution of local patterns from the local pattern dictionary can be visually explored by a user-configurable Star Coordinate view including detail-on-demand. Also, we introduce a hybrid design for embedding scatter plot motif views within a Parallel Coordinate display, allowing to relate local scatter plot patterns with further data dimensions. The made extensions provide additional contributions and improve the usefulness and analytical power of the proposed approach.

The remainder of this paper is structured as follows: in Section 2, we discuss related work and show commonalities and highlight differences. Section 3 gives an overview of our general idea to use local motif analysis for computing local and global interestingness measures. In Section 4, we present technical details of our approach. Section 5 gives an overview of our visual exploration tools to identify and analyze local motifs. Next, in Section 6, we apply our implementation to different data sets and showcase a local motif-driven exploration. Our approach is only a first step to scatter plot analysis based on local patterns, and we discuss limitations and a range of possible extensions in Section 7. Finally, Section 8 concludes the paper.

Section snippets

Related work

Several works support the exploration of large scatter plot data sets by means of ranking, filtering and searching functionalities. We next review a selection of works in the context of our approach.

Overview of our approach

The main goal of our approach is to guide the analyst through the exploration process, when facing a data set with a large number of individual scatter plots. Our main idea is to compute a dictionary of local scatter plot segments from the set of all scatter plots. The dictionary will contain prototype scatter plot segments (called motifs) that represent the different local scatter plot shapes occurring in a given data set. We form this dictionary by first partitioning all scatter plots into a

Global interest-measure based on local motifs

This section provides a technical overview of our implementation for detecting interesting scatter plot motifs and presents our aggregation scheme from local to global interestingness scores.

Visual exploration

In this section, we introduce our visual exploration approaches and demonstrate how it supports the selection of an appropriate dictionary size.

Application of motif-based dictionary

We now demonstrate the usefulness of our interest measure and the global scatter plot ranking by means of our visual exploration tools. First, we use a synthetic data set as a proof-of-concept to showcase our proposed interest measure. We then make use of the interest measure on a real-world data set and explore the suggested scatter plots.

Discussion of limitations and extensions

In our approach, we are interested to guide the exploration of scatter plots based on the notion of interestingness of local motifs in large scatter plot spaces. The concept of local pattern analysis is novel in that it extends beyond most feature-based scatter plot analysis methods, which consider global scatter plot features. Our solution is a first step to extend the analysis for local scatter plot patterns and depends on the choice of methods applied. Our prototype is implemented in Java

Conclusion

We introduced a novel workflow in which we analyze the interestingness of automatically extracted local motifs to guide the exploration in scatter plot data. To assess the overall interestingness, we adapted the tf×idf scheme from information retrieval to the domain of scatter plot motifs. We derive the interestingness of local scatter plot motifs based on its occurrence among and within the scatter plot space. Furthermore, we developed interactive visual exploration tools with brushing and

Acknowledgment

This work was partially supported by the State of Baden-Württemberg within the research project Visual Search and Analysis Methods for Time-Oriented Annotated Data, With Applications to Research and Open Data.

References (42)

  • R. Motta et al.

    Graph-based measures to assist user assessment of multidimensional projections

    Neurocomputing

    (2015)
  • P. Rousseeuw

    Silhouettesa graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • M. Ward et al.

    Interactive Data Visualization: Foundations, Techniques, and Applications

    (2010)
  • L. Wilkinson, A. Anand, R. Grossman, Graph-theoretic scagnostics, in: Proceedings of the IEEE Symposium on Information...
  • M. Sips, B. Neubert, J.P. Lewis, P. Hanrahan, Selecting good views of high-dimensional data using class consistency,...
  • A. Tatu et al.

    Automated analytical methods to support visual exploration of high-dimensional data

    IEEE Trans. Vis. Comput. Graph.

    (2011)
  • J. Yang, Y.-G. Jiang, A.G. Hauptmann, C.-W. Ngo, Evaluating bag-of-visual-words representations in scene...
  • L. Shao, T. Schleicher, M. Behrisch, T. Schreck, I. Sipiran, D.A. Keim, Guiding the exploration of scatter plot data...
  • W. Cleveland

    The shape parameter of a two-variable graph

    J. Am. Stat. Assoc.

    (1988)
  • J. Talbot et al.

    Arc length-based aspect ratio selection

    IEEE Trans. Vis. Comput. Graph.

    (2011)
  • M. Fink et al.

    Selecting the aspect ratio of a scatter plot based on its Delaunay triangulation

    IEEE Trans. Vis. Comput. Graph.

    (2013)
  • A. Mayorga et al.

    Splatterplotsovercoming overdraw in scatter plots

    IEEE Trans. Vis. Comput. Graph.

    (2013)
  • H. Chen, W. Chen, H. Mei, Z. Liu, K. Zhou, W. Chen, W. Gu, K.-L. Ma, Visual abstraction and exploration of multi-class...
  • M. Sedlmair et al.

    A taxonomy of visual cluster separation factors

    Comput. Graph. Forum (Proc. EuroVis 2012)

    (2012)
  • M. Sips, B. Neubert, J.P. Lewis, P. Hanrahan, Selecting good views of high-dimensional data using class consistency,...
  • D.J. Lehmann et al.

    Selecting coherent and relevant plots in large scatterplot matrices

    Comput. Graph. Forum

    (2012)
  • S. Bremm et al.

    Assisted descriptor selection based on visual comparative data analysis

    Wiley-Blackwell Comput. Graph. Forum

    (2011)
  • G. Albuquerque, M. Eisemann, D.J. Lehmann, H. Theisel, M.A. Magnor, Quality-based visualization matrices, in: VMV,...
  • A. Anand, L. Wilkinson, D. T. Nhon, Visual pattern discovery using random projections, in: IEEE VAST, 2012, pp....
  • A. Tatu, F. Maaß, I. Färber, E. Bertini, T. Schreck, T. Seidl, D.A. Keim, Subspace search and visualization to make...
  • N. Elmqvist et al.

    Rolling the dicemultidimensional visual exploration using scatterplot matrix navigation

    IEEE Trans. Vis. Comput. Graph. (Proc. InfoVis 2008)

    (2008)
  • Cited by (0)

    View full text