Imaging and Computing Seminar

Mauro Maggioni, Mathematics, Duke University

Intrinsic dimensionality estimation and multiscale geometry of data sets

The analysis of large data sets, modeled as point clouds in high dimensional spaces, is needed in a wide variety of applications such as recommendation systems, search engines, molecular dynamics, machine learning, statistical modeling, just to name a few. Oftentimes it is claimed or assumed that many data sets, while lying in high dimensional spaces, have indeed a low-dimensional structure. It may come perhaps as a surprise that only very few, and rather sample-inefficient, algorithms exist to estimate the intrinsic dimensionality of these point clouds. We present a recent multiscale algorithm for estimating the intrinsic dimensionality of data sets, under the assumption that they are sampled from a rather tame low-dimensional object, such as a manifold, and perturbed by high dimensional noise. Under natural assumptions, this algorithm can be proven to estimate the correct dimensionality with a number of points which is merely linear in the intrinsic dimension. Experiments on synthetic and real data will be discussed. Furthermore, this algorithm opens the way to novel algorithms for exploring, visualizing, compressing and manipulating certain classes of high-dimensional point clouds.