Online Monitoring of Distributed Systems Using Causal Event Patterns
MetadataShow full item record
Event monitoring and logging, that is, recording the communication events between processes, is a critical component in many highly reliable distributed systems. The event logs enable the identification of certain safety-condition violations, such as race conditions and mutual-exclusion violations, as safety is generally contingent on a specific causally ordered pattern of process communication. Previous efforts at finding such patterns have often focused on offline techniques, which are unable to identify operational problems as they occur. Online monitoring tools exist but they are often restricted to identifying a specific violation condition, such as a deadlock or a race condition, using dedicated data structures. We address the more general problem of detecting causally related event patterns that can be used to identify various undesired behaviours in the system. The main challenge for online pattern matching is the need to store the partial matches to the pattern, as they may combine with future events to form a complete match. Unlike pattern matching in most other domains, causally ordered patterns can span a potentially unbounded number of events and efficiently searching through this large collection poses a significant challenge. We present an efficient online causal-event-pattern-matching framework that bounds the number of partial matches it stores by reporting only a representative subset of pattern matches. We define a subset of matches as representative if it has at least one occurrence of each event in the pattern on each process, which is applicable for a large class of distributed applications. Our first pattern-matching algorithm, OCEP introduces a backtracking algorithm to efficiently find a representative subset from the history of events. An evaluation of the framework shows that OCEP is capable of handling several frequently occurring violation patterns at the event rates of some representative distributed applications. Our second algorithm, Ananke, introduces causality-based rules in the search pattern that can be used to specify the removal of an event from the maintained history. We used some of the most frequently occurring types of concurrency bugs in real-world applications to show that the desired causal order of events can be utilized to specify such removal rules. More importantly, these rules are able to maintain a finite history and still report a representative set of matches within a millisecond in most cases.