.. _comparisons: Comparing ivis with other dimensionality reduction algorithms ============================================================= Ivis aims to reduce data dimensionality whilst preserving both global and local structures. There are a number of real-world applications where this feature could be useful. For example: - Anomaly detection - Biological interpretation of high-throughput experiments - Feature extraction Several algorithms have been proposed to address the problem of dimensionality reduction, including `UMAP `__ and `t-SNE `__. UMAP in particular, has been succesfully applied in machine learning pipelines. Ivis is different to these approaches in several ways. First, ``ivis`` does not make any assumptions as to the inherent structure of the dataset. Second, ``ivis`` is designed to handle both small and extremely large datasets. Ivis performs well on toy datasets such as the *iris* dataset, and scales linearly to datasets with millions of observations. Indeed, we see that the main usecase for ``ivis`` are datasets with > 250,000 observations. Finally, ``ivis`` prioritises interpretation over visual apperance - this is accomplished by imposing meaning to distances between points in the embedding space. As such, ``ivis`` does not create spurious clusters nor does it artificially pack clusters closer together. Embeddings aim to be true to the original structure of the data, which can be noisy in a real-world setting. Visual Assessment ------------------ We will visually examine how popular dimensionality reduction algorithms - UMAP, t-SNE, Isomap, MDS, and PCA - approach two synthetic datasets with 5,000 observations in each. Since we are concerned with a dimensionality reduction problem, we will artificially add reduntant features to the original datasets using polynomial combinations (degree ≤ 10) of the original features. Random Noise ~~~~~~~~~~~~ To start, let's examine how various dimensionality reduction methods behave in the presence of random noise. We generated 5000 uniformly distributed random points in a two-dimensional space and expanded the feature set using polynomial combinations. In all cases default parameters were used to fit multiple models. .. image:: _static/random_projections_benchmaks.png Both ``ivis`` and PCA reliably recovered the random nature of our dataset. Conversely, Isomap, UMAP, and t-SNE appeared to pack certain points together, creating an impression of clusters within uniform random noise. Structured Datasets ~~~~~~~~~~~~~~~~~~~ Next, we examine how well global features of a dataset, such as relative position of clusters, can be recovered in a low-dimensional space. .. image:: _static/comparisons_moons.png Using default parameters, we can see that ``ivis`` captures both the general structure of each half-moon, as well as their relative positions to one another. Both UMAP and t-SNE appear to introduce spurious clusters and global relationships between the half-moons appear to be disrupted. .. image:: _static/comparisons_swiss_roll.png Similarly as above, UMAP and t-SNE appear to generate a large number of small clusters along the continuous distribution of the dataset. Although the global structure is relatively well-preserved. ``ivis`` maintains both global and local structures of the dataset. Quantitative Evaluation ----------------------- To measure how well each algorithm preserves global distances, we examined correlation between points in the original dataset and the embedding space. For this analysis, 10,000 observations were chosen from the `Levine dataset `__ (104,184 x 32) using random uniform sampling. Box plots represent distances across pairs of points in the embeddings, binned using 50 equal-width bins over the pairwise distances in the original space. Pearson correlation coefficients were also computed over the pairs of distances. .. image:: _static/comparisons_ivis_umap_levine_distances.png ``ivis`` appeared to preserve both a small-, mid-, and large-scale L1 and L2 distances, whilst UMAP and t-SNE seemed to ignore mid- to large-scale distances. Interestingly, ``ivis`` was particularly good at preserving L2 distances in low-dimensional space.