[Statlist] Reminder: DACO-FDS Seminar talk with Courtney Paquette and Elliot Paquette, McGill University, Canada - 18 April 2023

Mon Apr 17 08:51:42 CEST 2023

We are pleased to announce and invite you to the following joint talks in our DACO-FDS seminar series:

(Please note that you can choose to attend only the first part, the second part will dive deeper into technical aspects)

„DACO-FDS: Stochastic Algorithms in the Large: Batch Size Saturation, Stepsize Criticality, Generalization Performance, and Exact Dynamics (Part I)“
by Courtney Paquette, McGill University, Canada

Date and Time: Tuesday, 18 April 2023, 14.15 - 15.05 (Zurich)
Place: ETH Zurich, HG G 19.1

Abstract: In this talk, we will present a framework for analyzing dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD+M)) when both the number of samples and dimensions are large. For the analysis, we will introduce a stochastic differential equation, called homogenized SGD. We show that homogenized SGD is the high-dimensional equivalent of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples n and number of features d are polynomially related. By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the training dynamics and generalization performance of SGD in terms of a solution of a Volterra integral equation. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak form of delocalization of sample-side singular vectors of the data. By analyzing these limiting dynamics, we can provide insights into learning rate, momentum parameter, and batch size selection. For instance, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $O(1/ \kappa)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.

and 

„DACO-FDS: Stochastic Algorithms in the Large: Batch Size Saturation, Stepsize Criticality, Generalization Performance, and Exact Dynamics (Part II)“
by Elliot Paquette, McGill University, Canada

Date and Time: Tuesday, 18 April 2023, 15.10 - 16.00 (Zurich)
Place: ETH Zurich, HG G 19.1

Abstract:  In this talk, we will present a framework for analyzing dynamics of stochastic optimization algorithms (e.g., stochastic gradient descent (SGD) and momentum (SGD+M)) when both the number of samples and dimensions are large. For the analysis, we will introduce a stochastic differential equation, called homogenized SGD. We show that homogenized SGD is the high-dimensional equivalent of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples n and number of features d are polynomially related. By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the training dynamics and generalization performance of SGD in terms of a solution of a Volterra integral equation. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak form of delocalization of sample-side singular vectors of the data. By analyzing these limiting dynamics, we can provide insights into learning rate, momentum parameter, and batch size selection. For instance, we identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $O(1/ \kappa)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance. Finally we show this model matches performances on real data sets.

Organisers: A. Bandeira, H. Bölcskei, P. Bühlmann, F. Yang, S. van de Geer, R. Weismantel, R. Zenklusen

Seminar website: https://math.ethz.ch/news-and-events/events/research-seminars/daco-seminar.html
Seminar website: https://math.ethz.ch/sfs/news-and-events/data-science-seminar.html