Requirement-based Root Cause Analysis Using Log Data
MetadataShow full item record
Root Cause Analysis for software systems is a challenging diagnostic task due to complexity emanating from the interactions between system components. Furthermore, the sheer size of the logged data makes it often difficult for human operators and administrators to perform problem diagnosis and root cause analysis. The diagnostic task is further complicated by the lack of models that could be used to support the diagnostic process. Traditionally, this diagnostic task is conducted by human experts who create mental models of systems, in order to generate hypotheses and conduct the analysis even in the presence of incomplete logged data. A challenge in this area is to provide the necessary concepts, tools, and techniques for the operators to focus their attention to specific parts of the logged data and ultimately to automate the diagnostic process. The work described in this thesis aims at proposing a framework that includes techniques, formalisms, and algorithms aimed at automating the process of root cause analysis. In particular, this work uses annotated requirement goal models to represent the monitored systems' requirements and runtime behavior. The goal models are used in combination with log data to generate a ranked set of diagnostics that represent the combination of tasks that failed leading to the observed failure. In addition, the framework uses a combination of word-based and topic-based information retrieval techniques to reduce the size of log data by filtering out a subset of log data to facilitate the diagnostic process. The process of log data filtering and reduction is based on goal model annotations and generates a sequence of logical literals that represent the possible systems' observations. A second level of investigation consists of looking for evidence for any malicious (i.e., intentionally caused by a third party) activity leading to task failures. This analysis uses annotated anti-goal models that denote possible actions that can be taken by an external user to threaten a given system task. The framework uses a novel probabilistic approach based on Markov Logic Networks. Our experiments show that our approach improves over existing proposals by handling uncertainty in observations, using natively generated log data, and by providing ranked diagnoses. The proposed framework has been evaluated using a test environment based on commercial off-the-shelf software components, publicly available Java Based ATM machine, and the large publicly available dataset (DARPA 2000).