Carnegie Mellon University
15721 Database System Design and Implementation
Spring 2003 - C. Faloutsos

List of suggested projects

Introduction

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future.

Reminders:

Datasets:

Unless explicitly mentioned, the datasets are either  'public' or 'owned' by the instructor; for the rest, we need to discuss 'Non-disclosure agreements' (NDAs).

Graph data:

Time series/ sensor data Spatial/multimedia data Miscellaneous data

Available Software

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

List of Projects

Notation for the markers:

Spatial Access Methods and indexing

  1. [D] Fly-through the universe: Given information about galaxies  (x,y,z, illumination), we have a program that crudely simulates a space-craft through the universe (a few thousands of galaxies). A similar program, 'galaxy explorer' by Prof. Szalay works on 130,000 galaxies. We want to scale up to millions of galaxies, and to take into account the illumination of each galaxy. The engineering challenge is speed: how to store,  retrieve and display 500 million galaxies. We would also like to show the (available) telescope image of a galaxy, if we 'fly' close enough to it.
  2. [G,P] M-trees/OMNI-trees, and  Informedia: An indexing method for metric datasets [Ciaccia, Zezula, Patella, VLDB97], and it's follow-up, 'Slim-trees' and OMNI trees.  Do the analysis of OMNI-trees; devise better algorithms to choose 'anchors'. Benchmark OMNI trees against the existing SR-trees that Informedia is using for the TREC dataset. Help CMU win the next video-TREC competition!
  3. [G,P] Buffer tree: A clever approach to turn main-memory data structures into disk-based ones. Implement it. Contact person: Dr. Tony (Yufei) Tao.
  4. Data mining on the Astrophysics data: recall that we have records for galaxies ((x,y) coordinates, red shift, type of galaxy, spectrum etc) - we want to do dimensionality reduction, clustering, to find rules and to do model fitting. Start from [Faloutsos et al, SIGMOD2000] and [Traina et al, SBBD2000]; also check the papers by Alex Gray.

Sensor/stream data

  1. [PP] Finding patterns in moving objects: find patterns in moving objects (eg,. dolphins in the sea), using Kalman filtering. Start from the papers of  [Tsotras+, PODS99] and continuations [Arge; Tony Tao]; read up on Kalman filtering. The goal is to find objects that have similar trajectories. Also in this project: Dr. Tony Yufei Tao, who is visiting us.
  2. [P] Times series and blackouts: Given many co-evolving time sequences, some of which may temporarily 'black-out', estimate the black-out values. Again, consider using the powerful 'Kalman filtering' method.
  3. Disk access traffic patterns: given traces from real workstations (tuples of the form <disk-id, track-id, R/W-flag, timestamp>), find patterns; do predictions; use them to design better buffering and prefetching algorithms. Start from the 'pqrs' model of Mengzhi Wang (PEVA'02)
  4. Forecasting in mobile user data: we have (anonymized) data from mobile machines on the CMU campus. Find patterns and outliers, forecast demand.

'Hands-off data mining'

  1. [DPG] Semi-automatic data mining: given a dataset and some description of it, and a set of data mining tools and their descriptions, activate the suitable tools and report the  results. The challenge is on the design of a language (probably XML-based), to describe datasets and tool capabilities. Very ambitious project, but even small steps are useful.  Check the WinMine toolkit from Microsoft, which seems to focus mainly on Bayesian networks. The project is closely related to the 'HAMMOCK' project below.
  2. [D] 'HAMMOCK': GUI for computer-aided data mining. HAMMOCK allows users to feed datasets into selected tools (eg., to compute the fractal dimension etc). Build more tools for it; improve it's interface; add a data cleansing step (see below).
  3. [D] Data cleansing: independent, preprocessing module of the above project: Design a language and a system to specify constraints, that a given dataset should obey. Start from the Potter's wheel paper (Raman + Hellerstein); also check the VACUUM system from the dissertation of Rebecca Buchheit (CIT/CMU)

Network and graphs

  1. [PP] Large graph data mining: How do large graphs look like? Do they most obey the 'six degrees of separation'? How to pull out a sample of a graph, maintaining its vital properties? Which sub-graphs are 'unnatural'? Start from the power-law papers by [Barabasi], [Faloutsos+]; the follow-up work on graph properties [Reittu+] and generators [Townsley+], [Jamin+], on theories about them [Papadimitriou+]. Also working on the project: Deepay Chakrabarti.
  2. [P] Traffic matrix: we have measurements (# of packets, # of bytes), for several source-destination pairs within the CMU campus network. Study the properties of this matrix (is it skewed? could we make the independence assumption?), as well as its time evolution. It can also be viewed as a graph evolving over time. Closely related to the previous project

  3. [D] Large-Graph visualization: Review the literature; exploit the 'power-law' properties of real graphs, to provide plots for large graphs ( > O(10**6) nodes and edges). Check the web  site at Brown University; Pajek; Graphviz; also the sites on social networks (UCINET, analytictech.com).

Miscellaneous - Q-opt

  1. [G] Correlations for q-opt: how to capture, maintain and propagate correlations across numerical/categorical attributes, to get better estimates for q-opt (of high interest to our industrial contacts). Start with 2-d histograms; figure out how to estimate selectivities with a few 2-d histograms, when the selection involves 3 or more attributes, and/or joins. Check the high-end histograms [Poosala+Ioannidis].

C. Faloutsos, Feb. 13, 2003.