Previous semesters

The websites of courses taught in previous semesters can be found here.

Statistik und Wahrscheinlichkeitsrechnung

Mathematik IV: Statistik

Fundamentals of Mathematical Statistics

Applied ANOVA and Experimental Design

Bachelor, master and semester thesis topics

Below you can find topics for bachelor, master or semester theses that the supervisors at the Seminar for Statistics offer.
Please note: This site is still under construction.

Juan Gamella (Peter Bühlmann)

Contact: E-mail

Benchmarking causal discovery algorithms on real physical systems

Description: A fundamental difficulty in the field of causal inference is the absence of good validation datasets collected from real systems or phenomena. This is partly due to there being few incentives to collect and publish data from real systems that are already well understood, although such systems would be the ideal testbed for a large spectrum of causal and empirical inference algorithms. To address this problem, we have constructed two physical devices that allow measuring and manipulating different variables of simple but well-understood physical phenomena. The devices enable the inexpensive collection of large amounts of multivariate observational and interventional data, which, together with a justified causal ground truth, make them suitable to validate a wide range of causal inference algorithms.
In this project, you will help answer whether existing causal discovery algorithms can learn these simple systems. This will entail a literature review of existing methods, writing code (in Python and/or R) to benchmark the algorithms and analyzing the results towards a publication. You will get an overview of the field of causal discovery and I will teach you some basic software engineering (git, best practices,...) if you don't already have these skills.
Methods: Causal discovery
Knowledge: Some implementation skills required (Python and R).

Peter Bühlmann

Contact: E-mail

Using anchor regression for out of distribution generalization of contemporary widely used risk prediction models

Description: Collaboration with Olga Demler, Harvard: Our research aims to illustrate the effectiveness of anchor regression in achieving out-of-distribution generalizability for widely used risk prediction models. We will analyze data from the UK Biobank (UKB) and VITAL (a large randomized controlled trial), though we may consider replacing the VITAL dataset with a different one if it better serves the goals of our project.
Methods: anchor regression, regression, r-value
Knowledge:coding (preferably in R), experience in survival analysis and interest in generalizability and causality
Data: UK Biobank and VITAL – a large randomized controlled trial
Rothenhäusler, D., Meinshausen, N., Bühlmann, P. and Peters, J., 2021. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2), pp.215-246.
Kook, L., Sick, B. and Bühlmann, P., 2022. Distributional anchor regression. Statistics and Computing, 32(3), p.39.
Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H. and Cui, P., 2021. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624.
Jaljuli, I., Benjamini, Y., Shenhav, L., Panagiotou, O.A. and Heller, R., 2023. Quantifying replicability and consistency in systematic reviews. Statistics in Biopharmaceutical Research, 15(2), pp.372-385.
Lloyd-Jones, D. M. et al. Use of Risk Assessment Tools to Guide Decision-Making in the Primary Prevention of Atherosclerotic Cardiovascular Disease: A Special Report From the American Heart Association and American College of Cardiology. Circulation 139, e1162–e1177 (2019).
Visseren, F. L. J. et al. 2021 ESC Guidelines on cardiovascular disease prevention in clinical practice: Developed by the Task Force for cardiovascular disease prevention in clinical practice with representatives of the European Society of Cardiology and 12 medical societies With the special contribution of the European Association of Preventive Cardiology (EAPC). Eur. Heart J. 42, 3227–3337 (2021).
SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur. Heart J. 42, 2439–2454 (2021)

Markus Kalisch

Contact: E-mail

Robust Regression

Description: Standard assumptions in regression are oftentimes not met in practice. E.g., a single outlier might completely distort the result of an OLS regression. These outliers might be a nuisance (e.g. typo) or the main point of interest (e.g. a fraudulent transaction). Robust methods try to improve this situation by being less sensitive to severe model violations but at the same time try to produce reasonable estimates if the standard model assumptions are met. In this project, you will read publications in the area, write a summary, apply and implement methods in R, perform simulation studies.
Methods: Extensions to linear regression motivated by many applied fields of research
Knowledge: Linear Regression

Distributional Regression

Description: Distributional regression models that overcome the traditional focus on relating the conditional mean of the response to explanatory variables and instead target either the complete conditional response distribution or more general features thereof have seen increasing interest in the past decade. We will review several such methods, summarize and compare them, and think about pros and cons for practicioners.
Methods: Extensions to linear regression motivated by many applied fields of research
Knowledge: Linear Regression

Lukas Meier

Contact: E-mail

Bayesian Multilevel Models using Stan

Description: The R-package brms implements a wide range of multi-levels models (linear, generalized linear, ...) using a Bayesian approach which is based on STAN. The goal of this thesis is to get familiar with these approaches, compare to frequentist implementations like lme4 and highlight benefits and limitations.
Methods: Generalized Linear Models, Bayesian approaches
Knowledge: Linear regression, Generalized Linear Models, basics of Bayesian approaches

Nicolai Meinshausen

Contact: E-mail

Fairness in Machine Learning

Description: Read a few key publications in the area of fairness in Machine Learning and write a concise summary, highlighting key conceptual commonalities and differences
Methods: Linear regression and classification; tree ensembles; structural causal models
Knowledge: Regression and classification; causality
Data: some standard benchmark datasets can be used but can also be more theoretical

Invariant Risk Minimization

Description: Implement the invariant risk minimization framework of Arjovski (2019) and write a discussion
Methods: Linear models; tree ensembles; deep networks; causal inference
Knowledge: Machine Learning; Causality
Data: Datasets in paper or some other simple simulation data; possibly some larger datasets

Out-of-distribution generalizations

Description: Read some recent publications on out-of-distribution generalization and write a summary of their differences, advantages and drawbacks.
Methods: Linear models; tree ensembles; structural causal models
Knowledge: Regression and Classification; Causality
Data: Some small simulation studies; if of interest also larger datasets on ICU patient data

Quantile Treatment Effects

Description: Read on quantile treatment effects which characterize the possibly heterogenous causal effect and write a summary of current approaches
Methods: Linear models; tree ensembles; structural causal models; instrumental variables
Knowledge: Regression and Classification; Causality
Data: Can be theoretical; can also use some large-scale climate data

Xinwei Shen (Peter Bühlmann)

Contact: E-mail

Representation learning and distributional robustness

Description: to tackle prediction problems under distribution shifts, existing causality-inspired robust prediction methods are mostly developed under linear models, e.g. anchor regression. The goal of this thesis is to extend these methods to nonlinear models and in particular by utilizing (nonlinear) representation learning.
Methods: variational autoencoder, regression.
Knowledge: coding (preferably in python), some experience in neural networks and causality.
Data: mostly simulated; some real data such as single-cell or ICU data.
Anchor regression: heterogeneous data meets causality
Causality-oriented robustness: exploiting general additive interventions
Auto encoding variational Bayes

Fabio Sigrist

Contact: E-mail

Non-Gaussian random effects in machine learning models for non-Gaussian data

Description: Random effects models are widely used in statistics and machine learning for modeling hierarchically grouped (=clustered) data or data with high-cardinality categorical predictor variables. The goal of thesis is to investigate and compare non-Gaussian random effects models in machine learning models for non-Gaussian data.
Methods: Neural networks, tree-boosting, linear regression models, grouped random effects models
Knowledge: Coding (R or Python, ideally also C++)
Data: Simulated and real-world data sets that need to be collected for the thesis

Neural estimators for likelihood-free inference

Description: For some models, the likelihood cannot be (efficiently) evaluated but sampling from the model is easy. Doing inference (usually Bayesian) solely relying on samples from the likelihood without calculating the likelihood is called "likelihood-free inference" or "simulation-based inference". Neural estimators are a relatively recent approach for doing likelihood-free inference. They work by mapping the data to a set of parameters of a distribution usig neural networks. The goal of this thesis is to compare different neural estimators for various models and settings.
Methods: Neural networks, likelihood-free inference
Knowledge: Coding (R or Python)
Data: Real-world data sets that need to be collected for the thesis

The smoothness parameter in Matérn covariance functions for Gaussian processes

Description: Gaussian processes (GP) are a flexible class of probabilistic non-parametric function models which are used in both machine learning and statistics. Gaussian processes are defined by a mean and a covarinance function. For the latter, Matérn covariance functions are a flexible and widely-used class of covariance functions. The Matérn covariance function contains several (hyper-)parameters out of which the smoothness parameter is one. The goal of this thesis is to compare different approaches for estimating this smoothness parameter and to investigate the importance of the smoothness parameter for prediction accuracy.
Methods: Gaussian processes, convex optimization
Knowledge: Coding (R or Python, ideally also C++)
Data: Simulated and real-world data sets that need to be collected for the thesis

A comparison of sparse Cholesky factorization implementations

Description: The Cholesky decomposition is widely used for doing inference with (generalized) linear mixed effects models and (latent) Gaussian process models. For sparse matrices, there are various versions using different orderings and different libraries such as CHOLMOD and PARDISO. The goal of this thesis is to compare the performance (speed) of these different Cholesky factorization implementations for (i) generalized linear mixed effects models and latent Gaussian processes with sparse covariance or sprase precision matrices.
Methods: Linear mixed effects models, Gaussian processes
Knowledge: Solid knowledge in C++ and a high-level language such as R or Python
Data: Simulated and real-world data sets that need to be collected for the thesis

Applications of machine learning methods in environmental sciences

Description: The goal is to apply and compare modern machine lerning methods for environmental applications and, potentially, develop novel methods. The specific type of application will be discussed with the supervisor
Methods: Neural networks, tree-boosting, Gaussian processes, random effects
Knowledge: R or Python
Data: Simulated and real-world data sets that need to be collected for the thesis