[Statlist] ETH Young Data Science Researcher Seminar Zurich, Virtual Seminar by Merle Behr, University of California, Berkeley

Thu Oct 29 07:41:47 CET 2020

Dear all

We are glad to announce the following talk in the virtual ETH Young Data Science Researcher Seminar Zurich

"Learning Compositional Structures"  
by Merle Behr, University of California, Berkeley

Time: Friday, 30 October 2020, 15:00-16:00
Place: Zoom at https://ethz.zoom.us/j/92367940258

Abstract: Many data problems, in particular in biogenetics, often come with a highly complex underlying structure. This often makes it difficult to extract interpretable information. In this talk we want to demonstrate that often these complex structures are well approximated by a composition of a few simple parts, which provides very descriptive insights into the underlying data generating process. We demonstrate this with two examples.
In the first example, the single components are finite alphabet vectors (e.g., binary components), which encode some discrete information. For instance, in genetics a binary vector of length n can encode whether or not a mutation (e.g., a SNP) is present at location i = 1,…,n in the genome. On the population level studying genetic variations is often highly complex, as various groups of mutations are present simultaneously. However, in many settings a population might be well approximated by a composition of a few dominant groups. Examples are Evolve and Resequence experiments where the outer supply of genetic variation is limited and thus, over time, only a few haplotypes survive. Similarly, in a cancer tumor, often only a few competing groups of cancer cells (clones) come out on top.
In the second example, the single components relate to separate branches of a tree structure. Tree structures, showing hierarchical relationships between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable and the latent group structure represented by the tree. Such a relation can be highly complex, in general. However, often it is well approximated by a simple composition of relations associated with a few branches of the tree.
For both of these examples we first study theoretical aspects of the underlying compositional structure, such as identifiability of single components and optimal statistical procedures under probabilistic data models. Based on this, we find insights into practical aspects of the problem, namely how to actually recover such components from data.

Best wishes,

M. Azadkia, Y. Chen, M. Löffler, A. Taeb

Seminar website: https://math.ethz.ch/sfs/news-and-events/young-data-science.html