Collaborative Research: Data Mining Meets I/O Performance Evaluation: Advanced
Statistical Tools for Analyzing Bursty Traffic
Keywords
Data Mining, Time series, Bursty Traffic.
Project Award Information
-
Award Number: IIS-0083148
-
Duration: 09/01/01-08/31/04
-
Title: Collaborative Research: Data Mining Meets I/O Performance Evaluation:
Advanced Statistical Tools for Analyzing Bursty Traffic
Project Summary
The goal of this proposal is to develop and apply statistical tools to
analyze bursty time sequences, with emphasis on I/O traffic. The approach
has three parts: (a) advanced statistical tools using the ``ARFIMA'' method
(b) wavelets and the related ``80-20 law'' and (c) incorporation of these
models inside the so-called ``Active Disks'' The results will advance both
data mining, as well as disk design: For I/O and systems design,
it will provide new tools to handle bursty traffic, and better buffering
and prefetching; for data mining, it will develop new algorithms that model
self-similar time sequences (like web and network traffic, in addition
to I/O traffic).
This is a joint project with Prof. Anthony Brockwell (Stat./CMU) and
Prof. Tara Madhyastha (CS/UCSC).
Goals, Objectives, and Targeted Activities
The goal is to model bursty time sequences, like disk and web traffic,
where the traditional Poisson assumption does not hold. Once such a model
is found, we will use it for (a) forecasting of upcoming traffic (and hence
buffering and prefetching) (b) outlier detection (unexpected traffic patterns
may indicate, eg., intrusion attempts) and (c) analysis (since the Poisson
model is discredited, the M/G/1 and related models don't apply anymore).
The project is a cross-disciplinary one, merging expertize from statistics
(ARIMA models, Fractional Gaussian noise, Hurst exponent), with database
expertize (fast similarity search and indexing of time sequences) and with
storage systems expertize.
Indication of Success
We have already developed the 'b'-model, which uses the 80-20 law, to generate
realistic, bursty traffic [Wang01]. The model needs only one parameter,
the 'bias' parameter 'b'; the traffic it generates induces the response
times and queue length distributions that match the ones of real disk traffic
very well. More over, we have developed the 'pqrs' model which is the first
that captures the spatial and temporal locality of disk accesses, as well
as the correlation between them. This paper received the 'best student
paper' award in 'Performance Evaluation' 2002. Finally, we have developed
the 'AWSOM' system which captures arbitrary periodicities in a stream of
numerical data.
Project Impact
-
Human Resources: A Ph.D. candidate, Mr. Spiros Papadimitriou, is
working on the topic; two more Ph.D. students , Mr. Deepay Chakrabarti
and Mrs. Mengzhi Wang, work on aspects of the project in independent study
courses and/or other class projects.
-
Education and curriculum development: Several lectures on time series
analysis are incorporated in a new course, Multimedia databases
and data mining (15-826/10-603) which is a required course in
the CALD Masters program at CMU (CALD = Center for Automated Learning
and Discovery)
Project References
The following refereed publications mention the NSF support, since
Sept. 2001:
-
[Wang01] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou
and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast
Algorithms for Modeling Bursty Traffic, ICDE 2002, San Jose, CA, 2/26/2002
- 3/1/2002.
Refereed citations since September 2002:
-
Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive,
Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.
-
Mengzhi Wang, Anastassia Ailamaki and Christos Faloutsos, Capturing
the spatio-temporal behavior of real traffic data Performance 2002
(IFIP Int. Symp. on Computer Performance Modeling, Measurement and Evaluation),
Rome, Italy, Sept. 2002
-
Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated
Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.
Area Background
The project requires familiarity with time series analysis (ARIMA), as
well as with concepts from self-similar/chaotic time sequences (Hurst exponent,
variance plot).
Area References:
-
Jean Beran, Statistics for Long-Memory Processes Chapman and Hall,
1994.
-
Chris Ruemmler and John Wilkes, An introduction to disk drive modeling
IEEE Computer, March 1994.
GPRA performance criteria
Discoveries at and across the frontiers of science and engineering:
The project straddles many areas: databases (time series similarity search
and indexing), storage systems (disk modeling, buffering and prefetching)
and statistics (ARIMA, fractional Gaussian noise, Kalman filtering). Its
main goal is to use or adapt the best tools from statistics, to make them
scalable for huge disk traces, and to also push the envelope in statistics
by providing real datasets, where the traditional assumptions of Gaussian
noise and i.i.d. do not hold any more.
Connections between discoveries and their use in the service of society:
Modeling disk and web traffic is of extreme importance to computer systems
and to web servers. We believe that other forms of traffic (network traffic,
automobile traffic) as well as time series in general (temperatures over
time, outbreaks of diseases etc) may also be bursty, and thus amenable
to the same analysis by our tools. Finding patterns in time sequences is
the only way to do forecasting and to spot outliers, like intrusion in
disk/web/network traffic, denial-of-service attacks, criminal/terrorist
activities.