Efficient Kernel Methods for Statistical Detection

dc.comment.hiddenThe thesis will be printed in two-sided.en
dc.contributor.authorSu, Wanhua
dc.date.accessioned2008-03-28T18:49:25Z
dc.date.available2008-03-28T18:49:25Z
dc.date.issued2008-03-28T18:49:25Z
dc.date.submitted2008-03-20
dc.description.abstractThis research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection. We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step. Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines. One drawback of the existing LAGO is that it only provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO.en
dc.identifier.urihttp://hdl.handle.net/10012/3598
dc.language.isoenen
dc.pendingfalseen
dc.publisherUniversity of Waterlooen
dc.subjectstatistical detectionen
dc.subjectBayesian inferenceen
dc.subjectLAGOen
dc.subjectLaplace approximationen
dc.subjectsupport vector machinesen
dc.subjectk-nearest neighboren
dc.subject.programStatistics (Biostatistics)en
dc.titleEfficient Kernel Methods for Statistical Detectionen
dc.typeDoctoral Thesisen
uws-etd.degreeDoctor of Philosophyen
uws-etd.degree.departmentStatistics and Actuarial Scienceen
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
wsuthesis.pdf
Size:
3.4 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
241 B
Format:
Item-specific license agreed upon to submission
Description: