Adaptive Monitoring of Complex Software Systems using Management Metrics
Munawar, Mohammad Ahmad
MetadataShow full item record
Software systems supporting networked, transaction-oriented services are large and complex; they comprise a multitude of inter-dependent layers and components, and they implement many dynamic optimization mechanisms. In addition, these systems are subject to workload that is hard to predict. These factors make monitoring these systems as well as performing problem determination challenging and costly. In this thesis we tackle these challenges with the goal of lowering the cost and improving the effectiveness of monitoring and problem determination by reducing the dependence on human operators. Specifically, this thesis presents and demonstrates the effectiveness of an efficient, automated monitoring approach which enables detection of errors and failures, and which assists in localizing faults. Software systems expose various types of monitoring data; this thesis focuses on the use of management metrics to monitor a system's health. We devise a system modeling approach which entails modeling stable, statistical correlations among management metrics; these correlations characterize a system's normal behaviour This approach allows a system model to be built automatically and efficiently using the monitoring data alone. In order to control the monitoring overhead, and yet allow a system's health to be assessed reliably, we design an adaptive monitoring approach. This adaptive capability builds on the flexible nature of our system modeling approach, which allows the set of monitored metrics to be altered at runtime. We develop methods to automatically select management metrics to collect at the minimal monitoring level, without any domain knowledge. In addition, we devise an automated fault localization approach, which leverages the ability of the monitoring system to analyze individual metrics. Using a realistic, multi-tier software system, including different applications based on Java Enterprise Edition and industrial-strength products, we evaluate our system modeling approach. We show that stable metric correlations exist in complex software systems and that many of these correlations can be modeled using simple, efficient techniques. We investigate the effect of the collection of management metrics on system performance. We show that the monitoring overhead can be high and thus needs to be controlled. We employ fault injection experiments to evaluate the effectiveness of our adaptive monitoring and fault localization approach. We demonstrate that our approach is cost-effective, has high fault coverage and, in the majority of the cases studied, provides pertinent diagnosis information. The main contribution of this work is to show how to monitor complex software systems and determine problems in them automatically and efficiently. Our solution approach has wide applicability and the techniques we use are simple and yet effective. Our work suggests that the cost of monitoring software systems is not necessarily a function of their complexity, providing hope that the health of increasingly large and complex systems can be tracked with a limited amount of human resources and without sacrificing much system performance.