Simple Termination Criteria for Stochastic Gradient Descent Algorithm

Baghal, Sina

Simple Termination Criteria for Stochastic Gradient Descent Algorithm

Files

Baghal_Sina.pdf (1.96 MB)

Date

2021-04-09

Authors

Baghal, Sina

Advisor

Vavasis, Stephen

Publisher

University of Waterloo

Abstract

Stochastic gradient descent (SGD) algorithm is widely used in modern mathematical optimization. Because of its scalability and ease of implementation, SGD is usually preferred to other methods including the gradient descent algorithm in the large scale optimization. Similar to other iterative methods, SGD also needs to be employed in conjunction with a strategy to terminate the algorithm in order to prevent a phenomenon called overfitting. As overfitting is prevalent in supervised machine learning and noisy optimization problems, developing simple and practical termination criteria is therefor important. This thesis focuses on developing simple termination criteria for SGD for two fundamental problems: binary linear classification and least squares deconvolution. In the binary linear classification problem, we introduce a new and simple termination criterion for SGD applied to binary classification using logistic regression and hinge loss with constant step-size $\alpha>0$. Precisely, we terminate the algorithm once the margin is at least to 1. Namely, $$ \text{Terminate when }(2y_{k+1}-1)\zeta_{k+1}^T\theta_k\geq 1 $$ where $\theta_k$ is the current iterate of SGD and $(\zeta_{k+1},y_{k+1})$ is the sampled data point at the next iteration of SGD. Notably, our proposed criterion adds no additional computational cost to the SGD algorithm. We analyze the behavior of the classifier at termination, where we sample from a normal distribution with unknown means $\mu_0,\mu_1\in \mathbb{R}^d$ and variances $\sigma^2I_d$. Here $\sigma>0$ and $I_d$ is the $d \times d$ identity matrix. As such, we make no assumptions on the separability of the data set. When the variance is not too large, we have the following results: \begin{enumerate} \item The test will be activated for any fixed positive step-size. In particular, we establish an upper bound for the expected number of iterations before the activation occurs. This upper bound tends to a numeric constant when $\sigma$ converges to zero. In fact, we show that the expected time until termination decreases linearly as the data becomes more separable (\textit{i.e.}, as the noise $\sigma \to 0$). \item We prove that the accuracy of the classifier at termination nearly matches the accuracy of an optimal classifier. Accuracy is the fraction of predictions that a classification model got right while an optimal classifier minimizes the probability of misclassification when the sample is drawn from the same distribution as the training data. \end{enumerate} When the variance is large, we show that the test will be activated for a sufficiently small step-size. Finally, we empirically evaluate the performance of our termination criterion versus a baseline competitor. We compare performances on both synthetic (Gaussian and heavy-tailed $t$-distribution) as well as real data sets (MNIST and CIFAR-10 ). In our experiments, we observe that our test yields relatively accurate classifiers with small variation across multiple runs. The termination criteria for SGD for the least squares deconvolution problem has not been studied in the previous literature. In this thesis, we study the SGD algorithm with a fixed step size $\alpha$ applied to the least square deconvolution problem. We adopt the setting wherein the blurred image is contaminated with a Gaussian white noise. Under this model, we first demonstrate a novel concentration inequality which shows that for small enough step size $\alpha$, the SGD path should follow the gradient flow trajectory with overwhelming probability. Inspired by numerical observation, we propose a new termination criterion for SGD for the least squares deconvolution. As a first step towards developing theoretical guarantees for our termination criterion, we provide an upper bound for the $\ell_2$-error term for the iterate at termination when the gradient descent algorithm is considered. We postpone a full analysis of our termination criterion to future work.