Carnegie Mellon University
15721 Database System Design and Implementation
Spring 2003 - C. Faloutsos
List of suggested projects
Introduction
The projects are grouped according to their general theme. We also list
the data and software available, to leverage your effort. More links and
resources may be added in the future.
Reminders:
Datasets:
Unless explicitly mentioned, the datasets are either 'public' or
'owned' by the instructor; for the rest, we need to discuss 'Non-disclosure
agreements' (NDAs).
Graph data:
-
Graph data: from the movie-actor
database; network topology graphs (eg., from NLANR);
from the epinions.com trust networks; from citations. Contact the instructor
or Deepay Chakrabarti for cleaned-up copies of these datasets. Goals: find
how these graphs look like; build/adapt tools to visualize them and to
find patterns in them.
-
Social network data, software and
literature
-
Web-log and click-stream data (NDA: needed) - who visited which
page, over time
-
Visit patterns, anonymized, from
Microsoft: for several web pages, and thousands of users, we record
how many times a user visited a specific site. Find patterns, clusters,
fractal dimensions, regularities in the SVD etc.
-
Network traffic data: who sends packtes to whom, how much and when.
This is a graph evolving over time.
Time series/ sensor data
-
Sales data from a large retailer (~5Gb/week): For each customer,
we have the items purchased, price, time-stamp and store-id.
-
'Owner' of dataset: Prof. Bill Eddy. Some data are already loaded in IBM/DB2.
-
NDA: will be needed.
-
KURSK
dataset of multipe time sequences: time series from seismological
sensors by the explosion site of the 'Kursk' submarine.
-
Truck traffic data, from our Civil Engineering Department. Number
of trucks, weight etc per day per highway-lane. Find patterns, outliers;
do data cleansing.
-
Sunspots: number of sunspots
per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high spikes. There are
conjectures that the sunspot number of a given year is correlated with
other time sequences, like the width of tree-rings.
-
time sequences from the Sante-Fe
Institute forecasting competition (financial data, laser-beam oscillation
data, patients' apnea data etc)
-
Disk access
traces, from John
Wilkes at HP Labs. For each disk access, we have the timestamp, the
block-id, and the type ('read'/'write'). Here is a snippet
of the data, aggregated per 30'.
-
Contact person: instructor or Mengzhi Wang.
Spatial/multimedia data
-
Astrophysics
data - thousands of galaxies, with coordinates, red-shift, spectra,
photographs. Small
snippet of the data. More data are in the 'skyserver' web site, where
you can ask SQL queries
and get data in html or csv format
-
Road segments: several datasets with line segments (roads of U.S.
counties, Montgomery MD, Long Beach CA, etc)
Snippet
of data (roads from California, from TIGER).
-
Several collections of training data from the UC-Irvine
repository and from KDD-nuggets
for machine learning algorithms.
-
Demographic data from the U.S.
Bureau of Census
-
Video/image/sound data, from Informedia.
2Tb of video, segmented; 1M images with features; 10^4 faces - a snippet
of the data is here.
Another dataset has 80,000 images, each described by ~1,000 features. 80
queries and their desired results are also given, by the TREC conference.
-
Contact person: Norm Papernick.
-
Biological
data: images of proteins, with ~50 attributes each.
-
'Owner': Prof. Bob Murphy.
-
Mobile
user data: we have information about mobile machines on the
CMU campus - forecast the demand in each cell; spot abnormalities.
Miscellaneous data
-
FMRI data: neuron activation level, over time, for several voxels
in the brain. Several datasets, for many humans, engaged in different mental
activities.
Available Software
Notes for the software: Before you modify any code, please contact
the instructor - ideally, we would like to use these packages as black
boxes.
-
Readily available:
-
DR-tree : R-tree code; searches for range and nearest-neighbor queries.
In C.
-
B-tree code, for text (should be easily changed to handle numbers, too).
In C.
-
Digital Signal Processing modules for DFT, DCT and DWT In. nawk.
-
Code for SVD in
`mathematica'.
-
Code for computing the fractal dimension in Perl
and C, by Leejay Wu
-
code for ' HAMMOCK
' (GUI for several data mining tools)
-
contains fractal dimension estimators, Hurst exponent, 'Approximate Neighborhood
Function' ANF
-
Outside CMU:
-
GiST package from Hellerstein
at UC Berkeley: A general spatial access method, which is easy to customize.
It is already customized to yield R-trees.
-
Pajek: an
award winning graph analysis tool (MS windows only, it seems)
-
Graph visualization software: Graphviz etc see the site
at Brown University
List of Projects
Notation for the markers:
-
[P] : means that the project
has good chances to lead to a publication; [PP]:
means very much so!
-
[G] : means that there is already a
group of people working on it, who could help you out.
-
[D] : may lead to a fancy demo
Spatial Access Methods and indexing
-
[D] Fly-through the universe: Given
information about galaxies (x,y,z, illumination), we have a program
that crudely simulates a space-craft through the universe (a few thousands
of galaxies). A similar program, 'galaxy
explorer' by Prof. Szalay works on 130,000 galaxies. We want to scale
up to millions of galaxies, and to take into account the illumination of
each galaxy. The engineering challenge is speed: how to store, retrieve
and display 500 million galaxies. We would also like to show the (available)
telescope image of a galaxy, if we 'fly' close enough to it.
-
[G,P] M-trees/OMNI-trees, and Informedia:
An indexing method for metric datasets [Ciaccia,
Zezula, Patella, VLDB97], and it's follow-up, 'Slim-trees'
and OMNI
trees. Do the analysis of OMNI-trees; devise better algorithms to
choose 'anchors'. Benchmark OMNI trees against the existing SR-trees that
Informedia is using for the TREC dataset. Help CMU win the next video-TREC
competition!
-
[G,P] Buffer
tree: A clever approach to turn main-memory data structures into
disk-based ones. Implement it. Contact person: Dr. Tony (Yufei) Tao.
-
Data mining on the Astrophysics data: recall that we have records
for galaxies ((x,y) coordinates, red shift, type of galaxy, spectrum etc)
- we want to do dimensionality reduction, clustering, to find rules and
to do model fitting. Start from [Faloutsos
et al, SIGMOD2000] and [Traina
et al, SBBD2000]; also check the papers
by Alex Gray.
Sensor/stream data
-
[PP] Finding patterns in moving objects:
find patterns in moving objects (eg,. dolphins in the sea), using Kalman
filtering. Start from the papers of [Tsotras+,
PODS99] and continuations [Arge; Tony Tao]; read up on Kalman filtering.
The goal is to find objects that have similar trajectories. Also in this
project: Dr. Tony Yufei Tao, who is visiting us.
-
[P] Times series and blackouts: Given
many co-evolving time sequences, some of which may temporarily 'black-out',
estimate the black-out values. Again, consider using the powerful 'Kalman
filtering' method.
-
Disk access traffic patterns: given traces from real workstations
(tuples of the form <disk-id, track-id, R/W-flag, timestamp>), find
patterns; do predictions; use them to design better buffering and prefetching
algorithms. Start from the 'pqrs' model of Mengzhi Wang (PEVA'02)
-
Forecasting in mobile user data: we have (anonymized) data from
mobile machines on the CMU campus. Find patterns and outliers, forecast
demand.
'Hands-off data mining'
-
[DPG] Semi-automatic data mining: given
a dataset and some description of it, and a set of data mining tools and
their descriptions, activate the suitable tools and report the results.
The challenge is on the design of a language (probably XML-based), to describe
datasets and tool capabilities. Very ambitious project, but even small
steps are useful. Check the WinMine
toolkit from Microsoft, which seems to focus mainly on Bayesian networks.
The project is closely related to the 'HAMMOCK' project below.
-
[D] 'HAMMOCK':
GUI for computer-aided data mining. HAMMOCK allows users to feed datasets
into selected tools (eg., to compute the fractal dimension etc). Build
more tools for it; improve it's interface; add a data cleansing step (see
below).
-
[D] Data cleansing: independent, preprocessing
module of the above project: Design a language and a system to specify
constraints, that a given dataset should obey. Start from the Potter's
wheel paper (Raman + Hellerstein); also check the VACUUM system from
the dissertation
of Rebecca Buchheit (CIT/CMU)
Network and graphs
-
[PP] Large graph data mining: How do
large graphs look like? Do they most obey the 'six degrees of separation'?
How to pull out a sample of a graph, maintaining its vital properties?
Which sub-graphs are 'unnatural'? Start from the power-law papers by [Barabasi],
[Faloutsos+];
the follow-up work on graph properties [Reittu+]
and generators [Townsley+],
[Jamin+], on theories
about them [Papadimitriou+].
Also working on the project: Deepay Chakrabarti.
-
[P] Traffic matrix: we have measurements
(# of packets, # of bytes), for several source-destination pairs within
the CMU campus network. Study the properties of this matrix (is it skewed?
could we make the independence assumption?), as well as its time evolution.
It can also be viewed as a graph evolving over time. Closely related to
the previous project
-
[D] Large-Graph visualization:
Review the literature; exploit the 'power-law' properties of real graphs,
to provide plots for large graphs ( > O(10**6) nodes and edges). Check
the web site
at Brown University; Pajek; Graphviz; also the sites on social networks
(UCINET, analytictech.com).
Miscellaneous - Q-opt
-
[G] Correlations for q-opt: how to
capture, maintain and propagate correlations across numerical/categorical
attributes, to get better estimates for q-opt (of high interest to our
industrial contacts). Start with 2-d histograms; figure out how to estimate
selectivities with a few 2-d histograms, when the selection involves 3
or more attributes, and/or joins. Check the high-end histograms [Poosala+Ioannidis].
C. Faloutsos, Feb. 13, 2003.