Date of Completion

4-18-2014

Embargo Period

4-18-2014

Keywords

Web robot traffic, Web crawlers, Web crawling, WWW, performance analysis, Web mining, applied machine learning, power-tailed analysis, long-range dependence, statistical characterization, Web user classification

Major Advisor

Swapna Gokhale

Associate Advisor

Lester Lipsky

Associate Advisor

Jun-Hong Cui

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

It has been traditionally believed that humans, who exhibit well-studied behaviors and statistical regularities in their traffic, primarily generate the stream of traffic seen by Web servers. Over the past decade, however, the Web has seen a drastic increase in the number of requests initiated by contemporary Web robots or crawlers. These robots, whose traffic can be significant (upwards of 45% on the UConn School of Engineering Web server and 70% across digital libraries), exhibit sophisticated functionality and have widely varying demands. To prepare Web servers to handle this new generation of traffic with high performance, to develop methods that control and limit their behavior, and to understand how they interact with the social and sensitive data shared on the Web, a deep understanding of Web robots and their traffic qualities is essential. Unfortunately, the current understanding of robot traffic and their impact on the performance of a Web server is minimal. This deficiency is compounded by the fact that: (i) state-of-the-art methods for identifying their visits are very limited; and (ii) owing to the fundamental behavioral differences between robots and humans, we cannot assume that our knowledge of human behavior and traffic features transcend to robots.This dissertation addresses the above deficiencies and carries out a comprehensive evaluation of Web robot traffic on the Internet. We first introduce and demonstrate the effectiveness of a new approach for detecting robot traffic that is rooted in fundamental differences between robot and human behavior, and can run offline or in real-time. Secondly, we propose a multi-dimensional classification scheme to decompose robots based on their functionality, resource favoritism, and workload demands. Thirdly, using traces of requests to Web servers across many Internet domains, we reveal critical differences in the way robot and human traffic qualities do (not) exhibit power-tailed trends and long-range dependence in their arrival processes using a suite of analysis tools. Finally, we propose a novel predictive caching algorithm that can service Web robot and human traffic simultaneously and with much higher performance compared to caching algorithms that are used in practice.

COinS