Carnegie Mellon
University
Information Networking Institute
Scriptor: A Programming by Demonstration (PBD) System for Robust
Information Extraction from Web Sites
TR 2000-04
A Thesis Submitted To The Information
Networking Institute In Partial Fulfillment Of The Requirements For The Degree
Master of Science
in
Information Networking
by
Muhammad Nuri
Pittsburgh, Pennsylvania
May, 2000
1 Introduction 4
1.1 What is Programming by
Demonstration? 5
1.2 Programming by Demonstration for the
Web 6
1.3 Scope of Thesis 7
2 System Model 8
2.1 Trade-off between User Control and
Sophistication of PBD System 8
2.2 Segregating Records from a Web site
by Hints 10
2.3 Repetition of tags as an important
information in segregating records 11
2.4 Rule-based Extraction of Record
Fields 12
2.4.1 Follows <Attribute> 12
2.4.2 Precedes <Attribute> 13
2.4.3 Creating complex rules from basic
follows and precedes rules 13
2.5 Appearance-based extraction of
Record Fields 14
2.6 Sample-based Extraction of Record
Fields 15
3 Example of Use of System on Borders.com 17
3.1 Segregating records by providing
hints 17
3.2 Extracting individual fields of a
record 20
3.2.1 Extracting the title of book by
using appearance-based extraction 20
3.2.2 Extracting Price of Book by using
Rule-based Extraction 21
3.2.3 Extracting Publishing Year of Book
by Sample-based Extraction 22
3.3 Viewing the Results 23
4 Usability Study 24
4.1 Instructions to Subjects 24
4.2 Summarized Data from Experiment 25
4.2.1 Microwarehouse Computer Retailer 25
4.2.2 School of Information Science
Course List-University of Pittsburgh 26
4.2.3 E-bay Auction Site 27
4.2.4 Secapl Financial Site 28
4.2.5 Autotrader User Car Seller 29
4.3 Script Generation Time By Novice
Subjects 30
4.4 Other observations 30
4.5 Conclusions from Usability
Experiment 31
5 Conclusions and Future Work 32
6 Bibliography 33
7 Appendix B: Source Code of PBD System 34
7.1 Organization of Source Code 34
7.2 Core Function Files 34
7.3 Graphical User Interface Files 55
7.4 WebL Files 73
The question of the thesis is to determine whether its feasible to involve a novice user in the design of information agents and the idea is motivated by the need to extract information from a website independent of its underlying structure. These websites are one of a kind.
Information agents gather information autonomously from the Web. The connection between an information agent and the website from which it collects information is “brittle”. Frequent change in the look and structure of the website render these agents useless. These agents use scripts developed by professional programmers. These scripts use the underlying HTML tag structure to get information. With change in look and structure of website, the underlying HTML tag structure changes and the script does not work anymore.
One way around this problem is to hire a professional programmer who would reprogram the script for the information agent. However this approach is cumbersome and has drawbacks. If the programmer tries to modify the existing script, then we have associated errors of copy-paste-modify. If a completely new script is written, the approach is still infeasible that one would have to hire a programmer every time the website changes.
Compounding on the problem above is the need to process information from new websites, which are ever increasing on the Web. As these websites become available, a user would need to set up new information agents for them in helping him/her make a better decision. Thus in the face of these, we propose to involve a user in the design of information agents. When an information agent cannot gather from a website due to change in a website, it would notify its novice user to “reprogram” it. In another scenario, the user sees a new website for which it wants to set up an information agent and “programs” a script for it.
A novice user doesn’t know how to “program” which should be an important assumption. Here in the design of information agents, I want to give a novice user the ability to program. This programming ability cannot be the conventional programming language one learns in a computer science course. Elliot Soloway, director of the Highly Interactive Computing project at the University of Michigan, estimates that even for novices who do take a programming class, less than 1% continue to program when the class ends. How can then a computer be programmed without a programming language? I propose for a Programming-by-Demonstration System.
Programming by demonstration (PBD) or programming by example (PBE) is a powerful end-user-programming paradigm enabling users without formal training in programming to create sophisticated programs. PBD environments create programs for end users by observing and recording software behaviors as they manipulate information on the graphical user interface (GUI) level.
Maulsby concluded from its Turvy experiment that end novice users are content to program agents and want to do so through demonstrations with verbal hints and so on. PBD’s most important characteristic is that everyone can do it. PBD is not much different from or more difficult from using the computer normally. This characteristic also led me to consider PBD as an alternative to “easier” syntactic languages. The “easiness” of syntactic languages has been commented by Canfield etal. They argue that users need not think like computers to accomplish their task but rather every computer system should be simple and intuitive that they can be “programmed” with minimal effort. The system should move to the user rather the opposite happen in present days. The user moves to the system, thinks like a computer in order to accomplish the task. The GUI just makes the process more apparent but does not make it intuitive.
The Web is a great focus for PBD because of its accessibility to a wealth of knowledge, along with the pressing need for helping users organize, retrieve, and browser it all. Recent developments in intelligent agents can help – but only if users are able to communicate their requirements to and control the behavior of their agents. Typically these agents are scripted manually which requires programming skills. With the increasing number of web sites on the Internet, frequent change in the look and structure of web sites, manually programming agents takes valuable time and effort. We felt it necessary that PBD systems can be a flexible solution to the problem. With a PBD system, a novice user could generate sophisticated scripts in a very short time to gather specified information from the Web. An information agent would record the interactions between the user and a conventional direct manipulation interface and writes a program corresponding to user’s actions. The agent can then generalize the program so it works in other situations similar to, but not necessarily exactly the same as, the examples on which it was taught.
Our focus of study remained on sites like Amazon, which give listing of information items. We also attempted to make the PBD system simple and intuitive for a novice user. The following section would explain in detail the PBD system. We tested the system with novice subjects to determine the system’s usability in terms of accuracy and the time taken in script generation. The section on usability discusses the experiment we did with novice subjects.
In this chapter I discuss the model on which the PBD system is based. On a web page from which we wish to extract specific information records, there are two steps involved:
1. Segregate the records in a web page by:
§ Providing hints from the listing of information records, or
§ Using repetition of tags in a web page as an important information.
2. Extract individual fields of record by:
§ Rules-based extraction,
§ Appearance-based extraction, or
§ Sample-based extraction.
Before going further into explaining each of the methods in segregating records from a web page and then identifying individual fields in a record, I consider it important on what considerations the model was based.
The model has to be simplistic enough for a novice user. Here the novice means a regular Internet surfer or a person who usually writes his/her document using a word processor such as Word. We identified that there is a trade-off involved in the design of the system. If finer controls are given to the user then the system use becomes much more complicated. In designing PBD systems for novice users, one of the crucial assumption is to take the system should follow the principles of the novice mind. The novice need not like to think like a computer to program but it should be the other way around. The system should be simple and intuitive for the novice to understand. Since we cannot give a more robust system without increasing the user control of the system then we need to increase the sophistication of the system. Increasing the sophistication of the system essentially means using direct artificial methods, which take more computation time. Also such methods are based mostly on heuristics for the Web. With the changing character of Web, the system would be too brittle. The system model developed below is a result of an iterative exercise keeping the trade-off between user control and sophistication of system in mind. Further, the system was tested against subjects to affirm its accuracy and usability and thus prove that proper intelligence has been built into the system. The chapter on usability shows the results in more detail.
Before information about a single record could be extracted (such as its price of a book, its title etc.), the system needs to know what is a record. That is done by user providing a text from the area in the web page having list of records. The system then proposes candidate record samples to user. The user can accept or ask the system to try for a better one. This goes on until the user is satisfied with sample record. The following figure illustrates the segregating of records by providing hints.




In our PBD system, the web page is modeled as a string containing tags and text. When the user accepts the sample record, the system finds similar records to the appearance of sample record. Thus we then have set of records,
Record Set = S = {[R1], [R2], [R3]…}
Where Ri is the text and tags of individual records.
In majority of the web sites I observed, the record information was enclosing within one enclosing tag and thus I needed to gain only those tags to identify what is in a record. However on a few sites, the information for a single record was split among two tags. For example on Amazon web site, the title and author was in one enclosing tag and the pricing information was in the other enclosing tags. For such web sites, I found the following technique helpful but remember this technique costly in terms of computation and results are not very correct either.
On sites like book sites, auto-sites that display listings, there is an important tag information that can be used to segregate records from the web page. Though the information within tags changes but the tags remain same across all records. This means that if one could remove all the textual information from the web page and let only the tags, then there would be a string of tags, which would be repeating it. This is a heuristic, which would not work in all case. If there are fewer records returned, then one cannot detect such a string of tags. The more records there are from a search on a website, the more likelihood that the tag string would be repeating itself very often. A simple example would illustrate the idea. If the following are the tags left after removing all textual information from a web page,
Web Page Tags = “<a><b><c><a><d><a><d><a><d><a><d>”
Then one can see that tag string “<a><d>” is repeating itself four times and could most likely enclose record information. One can note that a threshold on repetition needs to be defined too. If the threshold were low, many tag strings would come up as possibly the tags enclosing record information. If the threshold is very high to be greater than the number of times the actual enclosing tag string repeats itself, then also one won’t receive any result and there would be no tag string returned. Thus an optimum is required which can be identified and re-identified by the user.
The individual fields are retrieved using the following tools
This tool works like this: Anything that follows the attribute would be picked up as the information. Consider the following simple example,
Books = {[“Java 1.1 $19.95 20%”], [“ ABC of Java $30.09 7%“], [“Programming by Example $49.99 13%”]…}
If price of book is desired then follows $ could be specified. The system would return 19.95, 30.09 and 49.99
![]()


This tool works like this: Anything that precedes the attribute would be picked up as the information. Consider the following simple example,
Books = {[“Java 1.1 $19.95 20%”], [“ ABC of Java $30.09 7%“], [“Programming by Example $49.99 13%”]…}
If discount on book is desired then precedes % could be specified. The system would return 20, 7 and 13.
AND-ing or OR-ing can combine follows and precedes rules to create more complex rules for extraction.
Appearance based extraction is based on the idea that certain fields in a record are formatted differently (e.g. underlined, bold, different color) from other fields in a records. This is valuable in cases where distinguishing attributes like % and $ are not present for fields. For example,
Books = {[“Java 1.1 $19.95 20%”], [“ ABC of Java $30.09 7%“], [“Programming by Example $49.99 13%”]…}
We see here that there are no distinguishing attributes to extract title of book. However, we see that title of every book is underlined. The user specifies ABC of Java and indicates to system. The system would return Java 1.1, ABC of Java and Programming by Example.
We can see from the above simple example that every piece of information was extracted from the set of book records. Figure also explains the concept. When I mention “Advanced Java”, I tell the system that anything that looks this is a Title in a record.

![]()



Sample-based extraction works as follows: The user specifies samples to the system and the system constructs a generic pattern for extraction. In the above example, if the system is given samples 19.95 and 30.09, it would construct a pattern dd.dd where d stands for digit. Anything that matches this pattern would be a price. The system would return 19.95, 30.09 and 49.99 as the price of book.
The following figure illustrates the generalization of publishing year from a set of samples from the web site.


The following illustrates the system Scriptor which is a practical implementation of the system model. The source code is given in the appendix. The system was developed in Java and WebL. WebL is a web query language developed by the Compaq Systems Lab.
You provide the system with a hint and then click on “Click here to see” to see what it proposes is a record.




The system shows a sample to tell you if that is what would be a typical record look like.

If you’re satisfied, you click “Yes this represent a record” and the system advances to the following screen.

You specify to the system that anything that has structural information like Advanced Java ( hyperlink, blue in color) is a title.

You specify $ in the attribute box and tell the system that anything in a record that follows a $ sign is a price of the book. If there are multiple occurrences of $, the first one is picked for information processing.

You specify samples to the system from two or more records from the listing of information records. The system generalizes on these samples.






And one can go on to view other records too as a result of tools applied to the website.
Software was developed to experiment with the usability of the PBD system model. An instructor acquainted the users with the software for 20 minutes. The purpose of instructor was to get them familiar with the software.
All subjects were given the same web pages to extract fields using rule-based, appearance-based or pattern-based extraction. Novice subjects were selected for this experiment. The pages selected were from diverse sites as well as diverse in appearance.
The following summarizes the data gathered from the experiment for different web sites.
|
|
Part |
Description |
Price |
|
Number of Subjects Getting
Correct Extraction |
8 |
8 |
2 |
|
Total Subjects |
8 |
8 |
8 |
|
Percent Correct |
100% |
100% |
25% |

|
|
CRN |
Description |
Prerequisites |
|||
|
Number of Subjects Getting
Correct Extraction |
10 |
10 |
10 |
|||
|
Total Subjects |
10 |
10 |
10 |
|||
|
|
100% |
100% |
100% |
|
|
Item |
Description |
Price |
|
Number of Subjects Getting
Correct Extractions |
9 |
6 |
9 |
|
Total Subjects |
10 |
10 |
10 |
|
Percent Correct |
90% |
60% |
90% |

|
|
Ticker |
Price |
Change |
|
Number of Subjects Getting
Correct Extraction |
9 |
8 |
9 |
|
Total Subjects |