Fault Diagnosis in Enterprise Software Systems Using Discrete Monitoring Data
Loading...
Date
2012-05-18T23:13:14Z
Authors
Reidemeister, Thomas
Advisor
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Success for many businesses depends on their information software systems.
Keeping these systems operational is critical, as failure in these systems is
costly. Such systems are in many cases sophisticated, distributed and
dynamically composed.
To ensure high availability and correct operation, it is essential that
failures be detected promptly, their causes diagnosed and remedial actions
taken. Although automated recovery approaches exists for specific problem
domains, the problem-resolution process is in many cases manual and painstaking.
Computer support personnel put a great deal of effort into resolving the reported
failures. The growing size and complexity of these systems creates the need to
automate this process.
The primary focus of our research is on automated fault diagnosis and recovery
using discrete monitoring data such as log files and notifications. Our goal is
to quickly pinpoint the root-cause of a failure. Our contributions are:
Modelling discrete monitoring data for automated analysis, automatically leveraging common symptoms of failures from historic
monitoring data using such models to pinpoint faults, and providing a model for decision-making under uncertainty such that
appropriate recovery actions are chosen.
Failures in such systems are caused by software defects, human error, hardware
failures, environmental conditions and malicious behaviour. Our primary focus
in this thesis is on software defects and misconfiguration.