Carnegie Mellon University
15-415 - Database Applications
Spring 2006

Homework 2 - Due: 2/8, 3:00pm.

For this assignment you will be provided with a sample database containing some relations populated with data. You will be asked to write SQL queries and execute them over the sample relations. You will also be asked to write some of the queries in relational algebra. Please go here for details on what you are to hand in (i.e., deliverables).

This assignment is NOT a group assignment. You are to work individually. You may discuss high-level issues with other students (or TAs), but you may not directly copy work. Please read the entire page before beginning your work.

Please go through the steps explained below to complete this assignment.
  1. DB Setup
  2. Hints for using psql
  3. Formulate queries
  4. Deliverables

DB Setup

What to do? How to do?
Log into the machine that was provided for assignment 0. ssh <user-name>@<machine-name>

Download the file hw2-script.tar.gz into your account. Please cut and paste the COMPLETE command (the URL has been broken across two separate lines). wget http://www.cs.cmu.edu/~natassa/courses/15-415/
current/assignments/hw2-script.tar.gz
Unpack the archive into the directory of your choice. This directory will be referred to as HWDIR henceforth. The archive should expand into a number of files, including loadData.sql. This is the file that contains the SQL statements that will create the sample database you are to use to test your queries. tar -xzvf hw2-script.tar.gz
Set the environment variable PGDATA to point to the directory in which you wish to store your data. mkdir $HOME/data; export PGDATA=$HOME/data
Prepare the data directory for use by the PostgreSQL server.
initdb $PGDATA
Start an instance of PostgreSQL on a port that no one else is using. For this purpose try using the last 4 or 5 digits of your student ID number as a port number. Remember, port numbers have to be between 4000 and 16000. If you find that you are unable to start the PostgreSQL server on that port just try another one until you can. If everything is okay you should see something like:
LOG: checkpoint record is at 0/807ACC
LOG: redo record is at 0/807ACC; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 487; next oid: 16977
LOG: database system is ready
pg_ctl start -o '-i -p <yourportnum>'
Create a new database called "hw1", that you can use for this assignment. createdb -p <yourportnum> hw1
Connect to your database using the psql client. Please make sure you change your current working directory to HWDIR (i.e., the directory containing the file loadData.sql) cd <HWDIR>; psql -p <yourportnum> hw1
You will receive a welcome message from PostgreSQL ...

Welcome to psql 7.3.3, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help on internal slash commands
\g or terminate with semicolon to execute query
\q to quit

... followed by a shell-style prompt

hw1=#
 
Load the sample data for this assignment. hw1=# \i loadData.sql


Hints for using psql

Typing \i filename.sql into the psql prompt instructs psql to load the file "filename.sql" and execute all of the SQL commands listed in that file. We recommend that you prepare your queries in a .sql file, to save you the trouble of retyping them. (Anyway, you will need to create an .sql file for your final submission.) You can use your favorite text editor (e.g., vi, emacs, pico, etc.) to edit this file.

Additional tips:

Formulate queries

Now that you have the database setup, its time to fire some queries! We have provided two data sets: (1) one describing technical publications , and (2) one describing a search engine backend.

Dataset 1: The DBLP bibliography database

The DBLP database stores information about various technical content published in conferences, workshops, etc. We have created a toy DBLP database for you, the underlying schema for which is:
authors (author_id INTEGER, author_name VARCHAR (50), author_affiliation VARCHAR (100))
papers (paper_id INTEGER, paper_title VARCHAR (100), category VARCHAR (30))
writes (author_id INTEGER, paper_id INTEGER)
The authors table assigns a unique identifier to each author in the database and stores his name, affiliation.
The papers table assigns a unique identifier to each paper in the database and stores its title and category (category refers to the general technical area to which the paper belongs, e.g., "Databases", "Machine Learning", etc.)
The writes table stores authorship information by associating the author_id of an author with the paper_id of the papers that he has authored (or co-authored.)

You have to write two queries over this dataset. Each query has to be written first as a Relational Algebra expression, and then translated into SQL.

Query 1: Display the names of all authors who have co-authored one or more papers with the author "Christopher Olston"

Query 2: Display the titles of all papers with two or more authors from "Carnegie Mellon University."

Dataset 2: The Doodle search engine

This database is the backend of a search engine called Doodle (which was developed recently by your TAs.) The database stores information about Web pages and provides a search capability on top of these pages. The schema of the database is as follows:
pages (page_id INTEGER, url VARCHAR (300))
sites (site_id INTEGER, name VARCHAR (100))
words (word_id INTEGER, word VARCHAR (30))
belongs (page_id INTEGER, site_id INTEGER)
contains (page_id INTEGER, word_id INTEGER, position INTEGER)
links (links_id INTEGER, linked_id INTEGER)
The pages table assigns a unique identifier to each Web page in the database and stores its complete URL.
The sites table assigns a unique identifier to each Web site in the database and stores its name.
The words table assigns a unique identifier to each word that occurs in the database (i.e., in any of the Web pages), and stores the complete word as a string.
The belongs table associates the identifier of a Web page with the identifier of the Web site to which it belongs.
The contains table associates the identifier of a Web page with the identifiers of each word that it contains, along with the position in the page where the word occurs (1st word at position 1, 2nd word at position 2 and so on).
The linkstable stores hyperlink information. The links_id field refers to the id of the Web page which links to the Web page with id linked_id. For example, a tuple of the form (1003, 429) in the links table says that the Web page with id 1003 links to the Web page with id 429.

You have to write five queries over this dataset. Each query has to be written first as a Relational Algebra expression, and then translated into SQL. For this problem, the queries are example keyword queries that search engine users might issue.

Query 1: Answer the query "superbowl 2006". You have to list the URLs of all Web pages that contain BOTH the words "superbowl" and "2006." Make sure you keep the words in lower case, Doodle isn't so smart as Google!

Query 2: Answer the query "site:www.cnn.com george bush". You have to list the URLS of all Web pages that belong to the site "www.cnn.com" and contain BOTH the words "george" and "bush". Again, make sure you keep both the words in lower case.

Query 3: Answer the query "(brad pitt)~1". You have to list the URLs of all Web pages that contain the words "brad" and "pitt" at consecutive positions (i.e., pages containing the phrase "brad pitt").

Query 4: List the names of all Web sites in which SOME Web page links to a Web page in the site "www.cnn.com".

Query 5: List the names of all Web sites in which EVERY Web page links to a Web page in the site "www.cnn.com".


Deliverables