Show simple item record

dc.contributor.authorBaghal, Sina
dc.date.accessioned2021-04-09 13:13:43 (GMT)
dc.date.available2021-04-09 13:13:43 (GMT)
dc.date.issued2021-04-09
dc.date.submitted2021-04-06
dc.identifier.urihttp://hdl.handle.net/10012/16872
dc.description.abstractStochastic gradient descent (SGD) algorithm is widely used in modern mathematical optimization. Because of its scalability and ease of implementation, SGD is usually preferred to other methods including the gradient descent algorithm in the large scale optimization. Similar to other iterative methods, SGD also needs to be employed in conjunction with a strategy to terminate the algorithm in order to prevent a phenomenon called overfitting. As overfitting is prevalent in supervised machine learning and noisy optimization problems, developing simple and practical termination criteria is therefor important. This thesis focuses on developing simple termination criteria for SGD for two fundamental problems: binary linear classification and least squares deconvolution. In the binary linear classification problem, we introduce a new and simple termination criterion for SGD applied to binary classification using logistic regression and hinge loss with constant step-size $\alpha>0$. Precisely, we terminate the algorithm once the margin is at least to 1. Namely, $$ \text{Terminate when }(2y_{k+1}-1)\zeta_{k+1}^T\theta_k\geq 1 $$ where $\theta_k$ is the current iterate of SGD and $(\zeta_{k+1},y_{k+1})$ is the sampled data point at the next iteration of SGD. Notably, our proposed criterion adds no additional computational cost to the SGD algorithm. We analyze the behavior of the classifier at termination, where we sample from a normal distribution with unknown means $\mu_0,\mu_1\in \mathbb{R}^d$ and variances $\sigma^2I_d$. Here $\sigma>0$ and $I_d$ is the $d \times d$ identity matrix. As such, we make no assumptions on the separability of the data set. When the variance is not too large, we have the following results: \begin{enumerate} \item The test will be activated for any fixed positive step-size. In particular, we establish an upper bound for the expected number of iterations before the activation occurs. This upper bound tends to a numeric constant when $\sigma$ converges to zero. In fact, we show that the expected time until termination decreases linearly as the data becomes more separable (\textit{i.e.}, as the noise $\sigma \to 0$). \item We prove that the accuracy of the classifier at termination nearly matches the accuracy of an optimal classifier. Accuracy is the fraction of predictions that a classification model got right while an optimal classifier minimizes the probability of misclassification when the sample is drawn from the same distribution as the training data. \end{enumerate} When the variance is large, we show that the test will be activated for a sufficiently small step-size. Finally, we empirically evaluate the performance of our termination criterion versus a baseline competitor. We compare performances on both synthetic (Gaussian and heavy-tailed $t$-distribution) as well as real data sets (MNIST and CIFAR-10 ). In our experiments, we observe that our test yields relatively accurate classifiers with small variation across multiple runs. The termination criteria for SGD for the least squares deconvolution problem has not been studied in the previous literature. In this thesis, we study the SGD algorithm with a fixed step size $\alpha$ applied to the least square deconvolution problem. We adopt the setting wherein the blurred image is contaminated with a Gaussian white noise. Under this model, we first demonstrate a novel concentration inequality which shows that for small enough step size $\alpha$, the SGD path should follow the gradient flow trajectory with overwhelming probability. Inspired by numerical observation, we propose a new termination criterion for SGD for the least squares deconvolution. As a first step towards developing theoretical guarantees for our termination criterion, we provide an upper bound for the $\ell_2$-error term for the iterate at termination when the gradient descent algorithm is considered. We postpone a full analysis of our termination criterion to future work.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectstochastic gradient descenten
dc.subjectbinary linear classificationen
dc.subjectleast squares deconvolutionen
dc.subjecttermination criterionen
dc.titleSimple Termination Criteria for Stochastic Gradient Descent Algorithmen
dc.typeDoctoral Thesisen
dc.pendingfalse
uws-etd.degree.departmentCombinatorics and Optimizationen
uws-etd.degree.disciplineCombinatorics and Optimizationen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws-etd.embargo.terms0en
uws.contributor.advisorVavasis, Stephen
uws.contributor.affiliation1Faculty of Mathematicsen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages