Sliding Window Query Processing over Data Streams
Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes sense for users to ask a query once and receive updated answers over time. <br /><br /> This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be re-engineered for the sliding window model because queries may need to re-process expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries. <br /><br /> The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing.