Title

Mining tools for high-dimensional time series data using spectral methods

Date of Completion

January 2007

Keywords

Mathematics|Statistics

Degree

Ph.D.

Abstract

Data mining tools are generally used to extract useful information from large databases. Although this is a relatively young field, several different techniques have been developed to handle complex high dimensional data sets, including clustering and classification, density estimation, trend estimation, structural analysis, factor analysis, etc. A collaborative effort among statisticians and computer scientists is important to solve many problems involving huge and complex databases, where the complexity increases if there is temporal and/or spatial dependence between variables. We address the problem of developing spectral domain based tools for mining time series data via a clustering algorithm. Such methods are useful in applications to biomedical, marketing or financial time series. ^ We discuss the problem of comparisons of several multivariate time series via their spectral properties. For two independent multivariate Gaussian stationary time series, such a comparison is made via a likelihood ratio test based on the estimated cross-spectra of the series. A simulation based critical value enables effective comparisons of several such multivariate time series. This is an extension of the maximum periodogram ordinate test developed in the literature to compare two independent univariate stationary time series. Further, a hierarchical clustering algorithm is developed to compare several multivariate time series with quasi-distances obtained via likelihood ratio test statistics derived for pairwise comparisons. ^ For a comparison of several independent categorical stationary time series, the notion of spectral envelope (Stoffer, Tyler and McDougall, 1993) is used to first transform the categorical time series to real-valued time series. A distance metric based on Chernoff divergence (Kakizawa, Shumway and Taniguchi, 1998) is computed from the spectra of two real-valued time series to be compared. These pairwise distances are then used in a hierarchical clustering algorithm. We illustrate this method using three different data sets. ^