dc.contributor.author Baghal, Sina dc.date.accessioned 2021-04-09 13:13:43 (GMT) dc.date.available 2021-04-09 13:13:43 (GMT) dc.date.issued 2021-04-09 dc.date.submitted 2021-04-06 dc.identifier.uri http://hdl.handle.net/10012/16872 dc.description.abstract Stochastic gradient descent (SGD) algorithm is widely used in modern mathematical optimization. Because of its scalability and ease of implementation, SGD is usually preferred to other methods including the gradient descent algorithm in the large scale optimization. Similar to other iterative methods, SGD also needs to be employed in conjunction with a strategy to terminate the algorithm in order to prevent a phenomenon called overfitting. As overfitting is prevalent in supervised machine learning and noisy optimization problems, developing simple and practical termination criteria is therefor important. This thesis focuses on developing simple termination criteria for SGD for two fundamental problems: binary linear classification and least squares deconvolution. en In the binary linear classification problem, we introduce a new and simple termination criterion for SGD applied to binary classification using logistic regression and hinge loss with constant step-size $\alpha>0$. Precisely, we terminate the algorithm once the margin is at least to 1. Namely, $$\text{Terminate when }(2y_{k+1}-1)\zeta_{k+1}^T\theta_k\geq 1$$ where $\theta_k$ is the current iterate of SGD and $(\zeta_{k+1},y_{k+1})$ is the sampled data point at the next iteration of SGD. Notably, our proposed criterion adds no additional computational cost to the SGD algorithm. We analyze the behavior of the classifier at termination, where we sample from a normal distribution with unknown means $\mu_0,\mu_1\in \mathbb{R}^d$ and variances $\sigma^2I_d$. Here $\sigma>0$ and $I_d$ is the $d \times d$ identity matrix. As such, we make no assumptions on the separability of the data set. When the variance is not too large, we have the following results: \begin{enumerate} \item The test will be activated for any fixed positive step-size. In particular, we establish an upper bound for the expected number of iterations before the activation occurs. This upper bound tends to a numeric constant when $\sigma$ converges to zero. In fact, we show that the expected time until termination decreases linearly as the data becomes more separable (\textit{i.e.}, as the noise $\sigma \to 0$). \item We prove that the accuracy of the classifier at termination nearly matches the accuracy of an optimal classifier. Accuracy is the fraction of predictions that a classification model got right while an optimal classifier minimizes the probability of misclassification when the sample is drawn from the same distribution as the training data. \end{enumerate} When the variance is large, we show that the test will be activated for a sufficiently small step-size. Finally, we empirically evaluate the performance of our termination criterion versus a baseline competitor. We compare performances on both synthetic (Gaussian and heavy-tailed $t$-distribution) as well as real data sets (MNIST and CIFAR-10 ). In our experiments, we observe that our test yields relatively accurate classifiers with small variation across multiple runs. The termination criteria for SGD for the least squares deconvolution problem has not been studied in the previous literature. In this thesis, we study the SGD algorithm with a fixed step size $\alpha$ applied to the least square deconvolution problem. We adopt the setting wherein the blurred image is contaminated with a Gaussian white noise. Under this model, we first demonstrate a novel concentration inequality which shows that for small enough step size $\alpha$, the SGD path should follow the gradient flow trajectory with overwhelming probability. Inspired by numerical observation, we propose a new termination criterion for SGD for the least squares deconvolution. As a first step towards developing theoretical guarantees for our termination criterion, we provide an upper bound for the $\ell_2$-error term for the iterate at termination when the gradient descent algorithm is considered. We postpone a full analysis of our termination criterion to future work. dc.language.iso en en dc.publisher University of Waterloo en dc.subject stochastic gradient descent en dc.subject binary linear classification en dc.subject least squares deconvolution en dc.subject termination criterion en dc.title Simple Termination Criteria for Stochastic Gradient Descent Algorithm en dc.type Doctoral Thesis en dc.pending false uws-etd.degree.department Combinatorics and Optimization en uws-etd.degree.discipline Combinatorics and Optimization en uws-etd.degree.grantor University of Waterloo en uws-etd.degree Doctor of Philosophy en uws-etd.embargo.terms 0 en uws.contributor.advisor Vavasis, Stephen uws.contributor.affiliation1 Faculty of Mathematics en uws.published.city Waterloo en uws.published.country Canada en uws.published.province Ontario en uws.typeOfResource Text en uws.peerReviewStatus Unreviewed en uws.scholarLevel Graduate en
﻿

### This item appears in the following Collection(s)

UWSpace

University of Waterloo Library
200 University Avenue West