Comparing ivis with other dimensionality reduction algorithms

Ivis aims to reduce data dimensionality whilst preserving both global and local structures. There are a number of real-world applications where this feature could be useful. For example:

Anomaly detection
Biological interpretation of high-throughput experiments
Feature extraction

Several algorithms have been proposed to address the problem of dimensionality reduction, including UMAP and t-SNE. UMAP in particular, has been succesfully applied in machine learning pipelines. Ivis is different to these approaches in several ways.

First, ivis does not make any assumptions as to the inherent structure of the dataset. Second, ivis is designed to handle both small and extremely large datasets. Ivis performs well on toy datasets such as the iris dataset, and scales linearly to datasets with millions of observations. Indeed, we see that the main usecase for ivis are datasets with > 250,000 observations. Finally, ivis prioritises interpretation over visual apperance - this is accomplished by imposing meaning to distances between points in the embedding space. As such, ivis does not create spurious clusters nor does it artificially pack clusters closer together. Embeddings aim to be true to the original structure of the data, which can be noisy in a real-world setting.

Visual Assessment

We will visually examine how popular dimensionality reduction algorithms - UMAP, t-SNE, Isomap, MDS, and PCA - approach two synthetic datasets with 5,000 observations in each. Since we are concerned with a dimensionality reduction problem, we will artificially add reduntant features to the original datasets using polynomial combinations (degree ≤ 10) of the original features.

Random Noise

To start, let’s examine how various dimensionality reduction methods behave in the presence of random noise. We generated 5000 uniformly distributed random points in a two-dimensional space and expanded the feature set using polynomial combinations. In all cases default parameters were used to fit multiple models.

_images/random_projections_benchmaks.png

Both ivis and PCA reliably recovered the random nature of our dataset. Conversely, Isomap, UMAP, and t-SNE appeared to pack certain points together, creating an impression of clusters within uniform random noise.

Structured Datasets

Next, we examine how well global features of a dataset, such as relative position of clusters, can be recovered in a low-dimensional space.

Using default parameters, we can see that ivis captures both the general structure of each half-moon, as well as their relative positions to one another. Both UMAP and t-SNE appear to introduce spurious clusters and global relationships between the half-moons appear to be disrupted.

Similarly as above, UMAP and t-SNE appear to generate a large number of small clusters along the continuous distribution of the dataset. Although the global structure is relatively well-preserved. ivis maintains both global and local structures of the dataset.

Quantitative Evaluation

To measure how well each algorithm preserves global distances, we examined correlation between points in the original dataset and the embedding space. For this analysis, 10,000 observations were chosen from the Levine dataset (104,184 x 32) using random uniform sampling. Box plots represent distances across pairs of points in the embeddings, binned using 50 equal-width bins over the pairwise distances in the original space. Pearson correlation coefficients were also computed over the pairs of distances.

ivis appeared to preserve both a small-, mid-, and large-scale L1 and L2 distances, whilst UMAP and t-SNE seemed to ignore mid- to large-scale distances. Interestingly, ivis was particularly good at preserving L2 distances in low-dimensional space.