Online Trace Reordering for Efficient Representation of Event Partial Orders
Sheikh, Muhammad Bilal
MetadataShow full item record
Distributed and parallel applications not only have distributed state but are often inherently non-deterministic, making them significantly more challenging to monitor and debug. Additionally, a significant challenge when working with distributed and parallel applications has to do with the fundamental requirement of determining the order in which certain actions are performed by the application. A naive approach for ordering actions would be to impose a single order on all actions, i.e., given any two actions or events, one must happen before the other. A global order, however, is often misleading, e.g., two events in two different processes may be causally independent yet one may have occurred before the other. A partial order of events, therefore, serves as the fundamental data structure for ordering events in distributed and parallel applications. Traditionally, Fidge/Mattern timestamps have been used for representing event partial orders. The size of the vector timestamp depends on the number of parallel entities (traces) in the application, e.g., processes or threads. A major limitation of Fidge/Mattern time- stamps is that the total size of timestamps does not scale for large systems with hundreds or thousands of traces. Taylor proposed an efficient offset-based scheme for representing large event partial orders by representing deltas between timestamps of successive events. The offset-based schemes have been shown to be significantly more space efficient when traces that communicate the most are close to each other for generating the deltas (offsets). In Taylor’s offset-based schemes the optimal order of traces is computed offline. In this work we adapt the offset-based schemes to dynamically reorder traces and demonstrate that very efficient scalable representations of event partial orders can be generated in an online setting, requiring as few as 100 bytes/event for storing partial order event data for applications with around 1000 processes.