Courses

Previous semesters

The websites of courses taught in previous semesters can be found here.

Statistik und Wahrscheinlichkeitsrechnung

Mathematik IV: Statistik

Fundamentals of Mathematical Statistics

Applied ANOVA and Experimental Design

Bachelor, master and semester thesis topics

Below you can find topics for bachelor, master or semester theses that the supervisors at the Seminar for Statistics offer.
Please note: This site is still under construction.

Mathieu Chevalley and Christoph Schultheiss

Contact: E-mail or E-mail

Evaluating causal models with interventional data

Description: Evaluating and selecting causal discovery models in practice is a challenging task, as the ground truth is by definition not known in an applied setting. This greatly reduces the applicability of such methods and thus calls for the development of evaluation metrics based only on empirical data. A recent paper makes an interesting contribution in that direction by proposing a statistical test to falsify causal DAGs using only empirical observational data. The goal of this project is to build on that idea and potentially extend it in the following ways: 1. Improve its computational scalability in terms of graph size, 2. Extend the metric to also leverage interventional data. We have a real-world biological dataset on which it can be tested and applied. The project can either take a more applied or theoretical direction depending on your interest.
Methods: Statistical test, permutation based test, causal discovery methods
Knowledge: Some knowledge on causality and coding (prefearbly in python)
Data: two large real-world interventional datasets
Literature:
Toward Falsifying Causal Graphs Using a Permutation-Based Test
CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data

Peter Bühlmann

Contact: E-mail

Cyrill Scheidegger

Contact: E-mail

Kernelised Conditional Independence Testing

Description: Conditional independence testing is an active area of research with many applications, for example in causality. Recently introduced methods include the generalised covariance measure (GCM) and its weighted generalisation (WGCM). In a new paper (https://arxiv.org/abs/2209.00124) the authors popose a kernelisation of the GCM/WGCM as an application of their more general theory. The goal of the project is to compare the kernelised version to the original tests more extensively than done in the original paper. Potentially, the methods can also be applied in the context of goodness of fit for nonlinear SEMs.
Methods:Kernel methods, (nonparametric) regression
Knowledge:Hypothesis testing, some knowledge in causality,
Data: Mostly simulated
Literature:
A general framework for the analysis of kernel-based tests
The hardness of coditional independence testing and the generalised covariance measure/a>
The Weighted Generalised Covariance Measure

Markus Kalisch

Contact: E-mail

Ordinal Response Models

Description: In many applied settings the response variable is an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. In this project, you will read publications in the area, write a summary, apply and implement methods in R, perform simulation studies.
Methods: Extensions to linear regression motivated by e.g. social sciences
Knowledge: Linear Regression

Nonparametric Regression and Generalized Additive Models

Description: A generalized additive model (GAM) is a generalized linear model in which the response variable depends linearly on unknown smooth functions of some predictor variables. In this project, you will read publications in the area, write a summary, apply and implement methods in R, perform simulation studies.
Methods: Extensions to linear regression motivated by many applied fields of research
Knowledge: Linear Regression

Lukas Meier

Contact: E-mail

Regression with Interval Censoring

Description: Read publications in the area, write a summary, apply and implement methods in R, perform simulation studies.
Methods: Special regression models motivated by survival analysis
Knowledge: Linear regression

Dyadic Regression Models

Description: Dyadic regression is used to model pairwise interaction data (between people, countries etc.), some models are also known as "gravity models". Read publications in the area, write a summary, apply and implement methods in R, perform simulation studies.
Methods: Regression
Knowledge: Linear regression

Nicolai Meinshausen

Contact: E-mail

Fairness in Machine Learning

Description: Read a few key publications in the area of fairness in Machine Learning and write a concise summary, highlighting key conceptual commonalities and differences
Methods: Linear regression and classification; tree ensembles; structural causal models
Knowledge: Regression and classification; causality
Data: some standard benchmark datasets can be used but can also be more theoretical

Invariant Risk Minimization

Description: Implement the invariant risk minimization framework of Arjovski (2019) and write a discussion
Methods: Linear models; tree ensembles; deep networks; causal inference
Knowledge: Machine Learning; Causality
Data: Datasets in paper or some other simple simulation data; possibly some larger datasets

Out-of-distribution generalizations

Description: Read some recent publications on out-of-distribution generalization and write a summary of their differences, advantages and drawbacks.
Methods: Linear models; tree ensembles; structural causal models
Knowledge: Regression and Classification; Causality
Data: Some small simulation studies; if of interest also larger datasets on ICU patient data

Quantile Treatment Effects

Description: Read on quantile treatment effects which characterize the possibly heterogenous causal effect and write a summary of current approaches
Methods: Linear models; tree ensembles; structural causal models; instrumental variables
Knowledge: Regression and Classification; Causality
Data: Can be theoretical; can also use some large-scale climate data

Xinwei Shen (Peter Bühlmann)

Contact: E-mail

Representation learning and distributional robustness

Description: to tackle prediction problems under distribution shifts, existing causality-inspired robust prediction methods are mostly developed under linear models, e.g. anchor regression. The goal of this thesis is to extend these methods to nonlinear models and in particular by utilizing (nonlinear) representation learning.
Methods: variational autoencoder, regression.
Knowledge: coding (preferably in python), some experience in neural networks and causality.
Data: mostly simulated; some real data such as single-cell or ICU data.
Literature:
Anchor regression: heterogeneous data meets causality
Causality-oriented robustness: exploiting general additive interventions
Auto encoding variational Bayes

Fabio Sigrist

Contact: E-mail

Choosing tuning parameters for boosting

Description: Tree-boosting is a popular machine learning method whose application requires the choice of multiple tuning parameters. The goal of this thesis is to get an overview of different methods for choosing tuning parameters and to compare the performance (mainly accuracy vs. computational time) of different methods such as (random) grid search and different Bayesian optimization methods (e.g., tree-structured Parzen estimator (TPE) and Gaussian process regression)
Methods: Tree-boosting
Knowledge: Applied machine learning and coding (preferably Python)
Data: Multiple publicly available real-world data sets that need to be collected for the thesis

Optimization methods for Gaussian process hyperparameter estimation

Description: Gaussian processes (GP) are a flexible class of probabilistic non-parametric function models which are used in both machine learning and statistics. The properties of Gaussian processes depend on a handful of so-called hyperparameters. These hyperparameters are usually selected by maximizing the log-likelihood function, which can be time-consuming. The goal of this thesis is to compare various optimization methods (e.g., gradient descent, Fisher scoring, Nelder-Mead, etc.) in terms of computational time.
Methods: Gaussian processes, convex optimization
Knowledge: Coding (R or Python) and Gaussian processes (ideally but not required)
Data: Simulated and real-world data sets that need to be collected for the thesis

The shape of log-likelihoods of latent Gaussian process models

Description: Gaussian processes (GP) are a flexible class of probabilistic non-parametric function models which are used in both machine learning and statistics. The properties of Gaussian processes depend on a handful of so-called hyperparameters. These hyperparameters are usually selected by maximizing the marginal log-likelihood function. For non-Gaussian data, these likelihood functions can have shapes for which it is difficult to find a maximum. The goal of this thesis is to explore the shape of log-likelihood functions under various settings. In particular, the goal is to develop recommendations concerning (i) under which settings likelihood functions are very flat and (ii) what to do in these cases.
Methods: Gaussian processes
Knowledge: Coding (R or Python) and Gaussian processes (ideally but not required)
Data: Simulated and real-world data sets that need to be collected for the thesis

The quality of large-data approximations for Gaussian processes

Description: Gaussian processes (GP) are a flexible class of probabilistic non-parametric function models which are used in both machine learning and statistics. For large data sets, however, computations with Gaussian processes quickly become unfeasible due to time and memory constraints. For this reason, various large-data approximations have been proposed. The goal of this thesis is to get an overview of the most common approximations and to systematically compare them (accuracy vs. computational time) on simulated and real-world data.
Methods: Gaussian processes
Knowledge: Coding (preferably Python) and Gaussian processes (ideally but not required)
Data: Simulated and real-world data sets that need to be collected for the thesis