Autonomous Cooperating Web Crawlers

McLearn, Greg

Autonomous Cooperating Web Crawlers

Files

glmclearn2002.pdf (459.31 KB)

Date

2002

Authors

McLearn, Greg

Publisher

University of Waterloo

Abstract

A web crawler provides an automated way to discover web events ? creation, deletion, or updates of web pages. Competition among web crawlers results in redundant crawling, wasted resources, and less-than-timely discovery of such events. This thesis presents a cooperative sharing crawler algorithm and sharing protocol. Without resorting to altruistic practices, competing (yet cooperative) web crawlers can mutually share discovered web events with one another to maintain a more accurate representation of the web than is currently achieved by traditional polling crawlers. The choice to share or merge is entirely up to an individual crawler: sharing is the act of allowing a crawler M to access another crawler's web-event data (call this crawler S), and merging occurs when crawler M requests web-event data from crawler S. Crawlers can choose to share with competing crawlers if it can help reduce contention between peers for resources associated with the act of crawling. Crawlers can choose to merge from competing peers if it helps them to maintain a more accurate representation of the web at less cost than directly polling web pages. Crawlers can control how often they choose to merge through the use of a parameter ρ, which dictates the percentage of time spent either polling or merging with a peer. Depending on certain conditions, pathological behaviour can arise if polling or merging is the only form of data collection. Simulations of communities of simple cooperating web crawlers successfully show that a combination of polling and merging (0 < ρ < 1) can allow an individual member of the cooperating community a higher degree of accuracy in their representation of the web as compared to a traditional polling crawler. Furthermore, if web crawlers are allowed to evaluate their own performance, they can dynamically switch between periods of polling and merging to still perform better than traditional crawlers. The mutual performance gain increases as more crawlers are added to the community.