Data-Dependent Geometries and Structures : Analyses and Algorithms for Machine Learning

A standard paradigm of supervised learning is the data-independent hypothesis space. In this model a data set is a sample of points from some space with a given geometry. Thus the distance between any two points in the space is independent of the particular sample. In a data-dependent geometry the distance depends on the particular points sampled. Thus for example consider a data set of “news stories,” containing a story in the Financial Times about a renewed investment in nuclear technology, and a story in the St. Petersburg Gazetteer about job losses from a decline in expected tourism. Although these appear initially to be dissimilar, the inclusion of a third story regarding an oil pipeline leakage creates an indirect “connection.” In the data-independent case the “distance” between stories is unchanged while in the data-dependent case, the distances reflect the connection. This project was designed to address the challenges posed both algorithmically and theoretically by data-defined hypothesis spaces. This project brought together three sites to address an underlying theme of the PASCAL2 proposal that of leveraging prior knowledge about complex data. The complexity of real world data is clearly offset by its intricate geometric structure – be it hierarchical, long-tailed distributional, graph based, and so forth. By allowing the data to define the hypothesis space we may leverage these structures to enable practical learning – the core aim of this project. This three-way collaboration was thought likely to give rise to a wide spectrum of possible applications, fostering future opportunities for joint research activities.

Knowledge 4 All Foundation Ltd.