Fault Diagnosis in Enterprise Software Systems Using Discrete Monitoring Data
MetadataShow full item record
Success for many businesses depends on their information software systems. Keeping these systems operational is critical, as failure in these systems is costly. Such systems are in many cases sophisticated, distributed and dynamically composed. To ensure high availability and correct operation, it is essential that failures be detected promptly, their causes diagnosed and remedial actions taken. Although automated recovery approaches exists for specific problem domains, the problem-resolution process is in many cases manual and painstaking. Computer support personnel put a great deal of effort into resolving the reported failures. The growing size and complexity of these systems creates the need to automate this process. The primary focus of our research is on automated fault diagnosis and recovery using discrete monitoring data such as log files and notifications. Our goal is to quickly pinpoint the root-cause of a failure. Our contributions are: Modelling discrete monitoring data for automated analysis, automatically leveraging common symptoms of failures from historic monitoring data using such models to pinpoint faults, and providing a model for decision-making under uncertainty such that appropriate recovery actions are chosen. Failures in such systems are caused by software defects, human error, hardware failures, environmental conditions and malicious behaviour. Our primary focus in this thesis is on software defects and misconfiguration.