CNPq: IMiMD - Indexing and Data Mining in Multimedia Databases
Keywords
Data mining, spatial access methods, metric access methods, multimedia
Project Award Information
-
Award Number: IIS-9988876
-
Duration: 2 years, starting 9/15/2000
-
Title: CNPq: IMiMD - Indexing and Data Mining in Multimedia
Databases
Project Summary
This project focuses on indexing multimedia data and on developing new
tools to find patterns and correlations in such data. Multimedia objects
can often be mapped to n-dimensional points through feature extraction.
If not, then they can be treated as metric data, when we are provided a
pair-wise distance function. Our methods focus on multimedia, metric and
spatial data alike. Typical questions include: "find video clips similar
to a given video clip"; "how strong is the correlation (or anti-correlation)
between the locations of schools and the locations of libraries?", "how
many schools are within 5 miles from libraries?".
This is a joint project with Prof. Caetano Traina from the University
of Sao Paulo, Brazil.
Goals, Objectives, and Targeted Activities
For indexing, the goals are (a) to provide formulas to estimate the selectivities
for similarity queries and (b) to build faster searching structures.
For data mining, the goals are to provide tools for detection of spatial
correlations and to develop fast visualization algorithms for spatial and
multimedia datasets.
Indication of Success
We have already achieved several of the above goals:
-
We have formulas for the analysis of Metric trees, as well as we have developed
the Slim Trees, which are metric access methods that are faster than the
M-trees [Traina 02]
-
We also developed the OMNI family of methods, which is the fastest for
metric datasets [Filho01]
-
We developed the 'triplot' tool, to find patterns accross two sets of points
[Traina01]
-
We developed the 'VideoGraph' tool to mine video clips [Pan01]
-
We generalized the Zipf distribution with our proposed 'DGX' distribution
[Bi01]. The work was runner up for the 'best paper award' in KDD01.
Long range results: This is an exploratory project, whose
aim is to show that power laws and fractals are the correct tools to use
for data mining and pattern discovery in large spatial and temporal datasets.
This is in contrast to the textbook approaches, which use the uniformity
and independence assumptions, and the Gaussian and Poisson distributions;
although easy to study, these assumptions are clearly unrealistic for an
overwhelming majority of real datasets.
Project Impact
-
Human Resources: A Ph.D. candidate, Mr. Leejay Wu, is working on
the project. This is a joint project with Univ. of Sao Paulo, Brazil: Profs
Caetano and Agma Traina are also involved, as well as their student, Mr.
Roberto Figueira Santos Filho.
-
Education and curriculum development: Several lectures on spatial
and metric access methods are incorporated in a new course,
Multimedia databases and data mining (15-826/10-603) which is
a required course in the CALD Masters program at CMU (CALD = Center
for Automated Learning and Discovery) as well as in the newly introduced
Ph.D.
program in Computational & Statistical Learning
Project References
The following refereed publications mention the NSF support, since
March 2001:
-
[Filho01] Roberto F. Santos Filho, Agma Traina, Caetano Traina Jr. and
Christos Faloutsos Similarity search without tears: the OMNI family of
all-purpose access methods ICDE 2001, Heidelberg, Germany, April 2-6 2001.
-
[Pan01] Jia-Yu Pan and Christos Faloutsos VideoGraph: A New Tool for Video
Mining and Classification JCDL'01
-
[Traina01] Agma Traina, Caetano Traina, Spiros Papadimitriou and Christos
Faloutsos Tri-Plots: Scalable Tools for Multidimensional Data Mining
KDD 2001, San Francisco, CA, August 2001.
-
[Bi01] Zhiqiang Bi, Christos Faloutsos and Flip Korn The "DGX" Distribution
for Mining Massive, Skewed Data KDD 2001, San Francisco, CA, August
2001. ("Best Paper Runner-Up" Award.)
-
[Traina02] Caetano Traina, Agma Traina, Christos Faloutsos, and Bernhard
Seeger, Fast Indexing and Visualization of Metric Datasets using
Slim-trees, IEEE-TKDE, 14, 2, pp. 244-260, March-April 2002.
Area Background
The project requires familiarity with spatial and metric access methods,
as well as with multimedia databases.
Area References:
-
Christos Faloutsos, Searching multimedia databases by content, Kluwer
Academic Publishers, Norwell, MA, 1996.
GPRA performance criteria
Discoveries at and across the frontiers of science and engineering:
The project straddles many areas: databases (spatial/metric access methods),
machine vision (eg., for face and image indexing), and fractals (several
real, metric datasets are self-similar, leading to better analysis)
Connections between discoveries and their use in the service of society:
Retrieval by multimedia content has numerous applications: medical image
retrieval, automatic video processing, scientific databases, to name a
few.