Collaborative Research: Data Mining Meets I/O Performance Evaluation: Advanced Statistical Tools for Analyzing Bursty Traffic

 
Christos Faloutsos Phone: (412)-268.1457
Department of Computer Science Fax : (412)-268.5576
Carnegie Mellon Univ. Email: christos@cs.cmu.edu
Pittsburgh, PA 15213 WWW page: http://www.cs.cmu.edu/~christos

Keywords

Data Mining, Time series, Bursty Traffic.

Project Award Information


Project Summary

The goal of this proposal is to develop and apply statistical tools to analyze bursty time sequences, with emphasis on I/O traffic. The approach has three parts: (a) advanced statistical tools using the ``ARFIMA'' method (b) wavelets and the related ``80-20 law'' and (c) incorporation of these models inside the so-called ``Active Disks'' The results will advance both data mining, as well as disk design: For I/O  and systems design, it will provide new tools to handle bursty traffic, and better buffering and prefetching; for data mining, it will develop new algorithms that model self-similar time sequences (like web and network traffic, in addition to I/O traffic).

This is a joint project with Prof. Anthony Brockwell (Stat./CMU) and Prof. Tara Madhyastha (CS/UCSC).

Goals, Objectives, and Targeted Activities

The goal is to model bursty time sequences, like disk and web traffic, where the traditional Poisson assumption does not hold. Once such a model is found, we will use it for (a) forecasting of upcoming traffic (and hence buffering and prefetching) (b) outlier detection (unexpected traffic patterns may indicate, eg., intrusion attempts) and (c) analysis (since the Poisson model is discredited, the M/G/1 and related models don't apply anymore).  The project is a cross-disciplinary one, merging expertize from statistics (ARIMA models, Fractional Gaussian noise, Hurst exponent), with database expertize (fast similarity search and indexing of time sequences) and with storage systems expertize.

Indication of Success

We have already developed the 'b'-model, which uses the 80-20 law, to generate realistic, bursty traffic [Wang01]. The model needs only one parameter, the 'bias' parameter 'b'; the traffic it generates induces the response times and queue length distributions that match the ones of real disk traffic very well. More over, we have developed the 'pqrs' model which is the first that captures the spatial and temporal locality of disk accesses, as well as the correlation between them. This paper received the 'best student paper' award in 'Performance Evaluation' 2002. Finally, we have developed the 'AWSOM' system which captures arbitrary periodicities in a stream of numerical data.

Project Impact

Project References

The following refereed publications mention the  NSF support, since Sept. 2001:
  1. [Wang01] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002, San Jose, CA, 2/26/2002 - 3/1/2002.
Refereed citations since September 2002:
  1. Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.
  2. Mengzhi Wang, Anastassia Ailamaki and  Christos Faloutsos, Capturing the spatio-temporal behavior of real traffic data Performance 2002 (IFIP Int. Symp. on Computer Performance Modeling, Measurement and Evaluation), Rome, Italy, Sept. 2002
  3. Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.

Area Background

The project requires familiarity with time series analysis (ARIMA), as well as with concepts from self-similar/chaotic time sequences (Hurst exponent, variance plot).
 

Area References:

GPRA performance criteria

Discoveries at and across the frontiers of science and engineering: The project straddles many areas: databases (time series similarity search and indexing), storage systems (disk modeling, buffering and prefetching) and statistics (ARIMA, fractional Gaussian noise, Kalman filtering). Its main goal is to use or adapt the best tools from statistics, to make them scalable for huge disk traces, and to also push the envelope in statistics by providing real datasets, where the traditional assumptions of Gaussian noise and i.i.d. do not hold any more.

Connections between discoveries and their use in the service of society: Modeling disk and web traffic is of extreme importance to computer systems and to web servers. We believe that other forms of traffic (network traffic, automobile traffic) as well as time series in general (temperatures over time, outbreaks of diseases etc) may also be bursty, and thus amenable to the same analysis by our tools. Finding patterns in time sequences is the only way to do forecasting and to spot outliers, like intrusion in disk/web/network traffic, denial-of-service attacks, criminal/terrorist activities.