Multivariate data visualization

Continious variables involved in the following:

Parallel coordinate plots
Heatmaps
Star charts

Parallel coordinates

Construction:

Vertical axis: Values
Horizontal Axis : Variables
1 trace line = 1 observation

Analysis: - Clusters - Outliers - Correlated variables

Parallel coordinates

Analysis

Positive correlation between two adjacent variables: almost all segments are parallel to each other
Clusters in some variable space: several trace lines that are near each other and have similar pattern
Outliers: trace lines that have unusual pattern and/or fall out outside the common plot area

Problems

Trace lines overlap each other -> difficult to find patterns, difficult to follow a specific trace line
Analysis depends much on the order of variables (correlation, clusters) -> a proper reordering may improve the analysis

Parallel coordinates

Example: Iris dataset - How many clusters do you see?

Parallel coordinates

Sometimes clusters overlap with categories given by some variable
- Non-mixing groups is not the same as clustering!

Ordering problem

Problem of ordering (variables, observations) is one of the key problems in multidimensional visualization
- Sometimes has a huge impact on perception (heatmaps)
A lot of approaches exist

Problem formulation: Given data set \(\chi=\left( x_{ij}| i=1, \ldots, n, j=1, \ldots, p \right)\)

Select order \(\Psi={i_1, \ldots, i_p}\) that optimizes visual perception (analysis) -> this defines reordering of data columns \(\Psi: \chi \rightarrow \chi'\)

Note: \(p!\) possible orderings exist…

Ordering problem

Solution

early approaches (for ex. Ankerst et al. 1998):

Choose a distance (proximity) matrix \(D=\left\{d_{ij}=d(x^i, x^j)\right\}\) between variables (columns)
- Euclidian distance on scaled columns
- 1- correlation
This defines graph with vertices \(1, \ldots, p\) and edge weights \(d_{ij}\) -> Hamiltonian path (Traveling Salesman Problem)

\(\min_{\Psi} {\sum_{j=1}^{p-1} d'_{j,j+1}}\)

TSP is NP-complete -> Approximate solutions are used

Ordering problem

Solution: modern approaches

Based on:
- Decreasing visual clutter
- Clustering data points/dimensions
- Outlier detection
- Dimensionality reduction (for ex. MDS)
- …
Note: most of these can be applied both for ordering observations and ordering variables
- Just transpose the data matrix…

Ordering problem

Objective functions:

Gradient measures (anti-Robinson)
Hamiltonian path length
Least squares
…

They based on \(\min_{\Psi} {L(\Psi(D))}\)

Optimization algorithms:

Partial enumeration
Traveling salesman solvers
Hierarchical clustering
…

Gradient measures

Aim: distances should increase from diagonal

\[ d_{ik}\leq d_{ij} \mbox{ for } 1\leq i<k<j\leq n \] \[ d_{kj}\leq d_{ij} \mbox{ for } 1\leq i<k<j\leq n \] Objective function:

\[ L(D)= \sum_{i<k<j}f(d_{ik}, d_{ij})+f(d_{kj}, d_{ij}) \] where \[ f(z,y)=sign(z-y) \mbox{ or } f(z,y)=z-y \]

Other objectives

Hamiltonian path length:

\[ L(D)=\sum_{i=1}^{n-1} d_{i,i+1} \]

Least squares criterion (PCA)

Solution is similar to first PCA component

\[ L(D)= \sum_i \sum_j (d_{ij}-|i-j|)^2 \]

Optimization algorithms

Partial enumeration methods

Ex: Branch and bounds and dynamic programming
Constructing solutions by parts

TSP solver

Suitable for hamiltonian path objective
Find shortest path by dynamic programming or heuristics

Optimization algorithms

Hierarchical clustering

Observations are joined into clusters
Clusters are joined in larger clusters
Until only one cluster left
Leaves and branches are permuted to minimize given objective

Effect of ordering

Heatmaps

A heat map visualizes a matrix [ n x m]

Normally rows=observations, columns= parameters
Heatmap has the corresponding size
Each cell of the matrix corresponds a cell in the heatmap
High values correspond intense colors in this map (or visa versa for other color schemes!)
Names of variables and observations are shown

Heatmaps

Analysis:

Compare the values of a parameter for different observations (row)
Compare the values for a single observation (column)
Compare the patterns for different rows or columns
Find similar observations (areas with the same color intensity)
Find which variables define similarity for a group
Find correlated variables (similar pattern within a column)

Heatmaps

Exercise (last picture):
- How many clusters do you see?
- Which variables define clusters?
- Which variables are correlated?

Effect of reordering

Gradient measure objective used
See new analysis possibilities

Radar charts

Use polar coordinate system
Map column value as a coordinate in certain direction

Radar charts

If juxtaposed, analyse:

Clusters
Outliers
Outlying directions

If superimposed,

Comparing variable length
Seeing similar and outlying observations

Radar charts

Problems:

Difficult to judge orientations
Number of dimensions are observations is very limited
- Number of observations is extremely limited if superimposed
More close radar charts easier to compare
Perception is much affected by observation ordering

Ordering:

Same as before plus
Dimensions can be sorted to promote more symmetric charts

Radar charts

Now with reordering by Gradient Measures

Radar charts

Other positioning possible - PCA/MDS

Trellis plots / facets

Shingles

Creates overlap
To avoid boundary effects

Example: Aids data (Age, Time of Death, Time of Diag)

Shingles

Aids data: conclusions?

Read at home

Chapter 5
Paper "Hahsler, M., Hornik, K., & Buchta, C. (2008). Getting things in order: an introduction to the R package seriation. Journal of Statistical Software, 25(3), 1-34".
(Browse through) paper "Ankerst, M., Berchtold, S., & Keim, D. A. (1998, October). Similarity clustering of dimensions for an enhanced visualization of multidimensional data. In Information Visualization, 1998. Proceedings. IEEE Symposium on (pp. 52-60). IEEE."
Becker, R. A., Cleveland, W. S., & Shyu, M. J. (1996). The visual design and control of trellis display. Journal of computational and Graphical Statistics, 5(2), 123-155.