On Convergence Analysis of Stochastic and Distributed Gradient-Based Algorithms

Loading...
Thumbnail Image

Date

2024-05-08

Authors

Zhang, Mengyao

Advisor

Zhang, Mengyao

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Optimization is a fundamental mathematical discipline focused on finding the best solution from a set of feasible choices. It is vital in various applications, including engineering, economics, data science, and beyond. Stochastic optimization and distributed optimization are crucial paradigms in the optimization field. Stochastic optimization deals with uncertainty and variability in problem parameters, providing a framework for decision-making under probabilistic conditions. On the other hand, distributed optimization tackles large-scale problems by harnessing the collective power of multiple agents or nodes, each with local information and local communication capabilities. This thesis aims to modify and analyze the existing stochastic methods and develop the algorithms and theory to solve the unconstrained distributed optimization problem. For stochastic adaptive gradient-based methods, including RMSprop, Adadelta, Adam, AdaGrad, Nadam, and AMSgrad, which are popular stochastic optimization methods commonly used in machine learning and deep learning, the Chapter 2 provides a concise and rigorous proof of the almost sure convergence guarantees towards a critical point in the context of smooth and non-convex objective functions. To the best of our knowledge, this work offers the first almost sure convergence rates for these stochastic adaptive gradient-based methods. For non-convex objective functions, we show that a weighted average of the squared gradient norms in each aforementioned method achieves a unified convergence rate of $o(1/{t^{\frac{1}{2}-\theta}})$ for all $\theta\in\left(0,\frac{1}{2}\right)$. Moreover, for strongly convex objective functions, the convergence rates for RMSprop and Adadelta can be further improved to $o(1/{t^{1-\theta}})$ for all $\theta\in\left(0,\frac{1}{2}\right)$. {These rates are arbitrarily close to their optimal convergence rates possible.} As a locking-free parallel stochastic gradient descent algorithm, Hogwild! algorithm is commonly used for training large-scale machine learning models. In Chapter 3, we will provide an almost sure convergence rates analysis for Hogwild! algorithm under different assumptions on the loss function. We first prove its almost sure convergence rate on strongly convex function, which matches the optimal convergence rate of the classic stochastic gradient descent (SGD) method to an arbitrarily small factor. For non-convex loss function, a weighted average of the squared gradient, as well as the last iterations of the algorithm converges to zero almost surely. We also provide a last-iterate almost sure convergence rate analysis for this method on general convex smooth functions. Another aspect of the research addresses the convergence rate analysis of the gradient-based distributed optimization algorithms, which have been shown to achieve computational efficiency and rapid convergence while requiring weaker assumptions. We first propose a novel gradient-based proportional-integral (PI) algorithm in Chapter 4, and prove that its convergence rate matches that of the centralized gradient descent method under the strong convexity assumption. We then relax this assumption and discuss the local linear convergence of its virtual state for strictly convex cost functions. In Chapter 5, we propose the powered proportional-integral (PI) algorithm and prove its convergence in finite time under the assumption of strict convexity. Then, we discuss the fixed-time convergence of its virtual state for strongly convex cost functions. Finally, we demonstrate the practicality of the distributed algorithms proposed in this thesis through simulation results.

Description

Keywords

Optimization, Distributed Optimization, Stochastic Optimization, Adaptive Methods

LC Keywords

Citation