Word_Template.rtf

Topic Labeling of Broadcast News Stories in the Informedia Digital Video Library

Alexander G. Hauptmann and Danny Lee

Department of Computer Science

Carnegie-Mellon University

Pittsburgh, PA 15213-3890, USA

Tel: 1-412-268-1448

E-mail: {alex,dlee}@cs.cmu.edu

ABSTRACT

This paper describes the implementation of a topic labeling component for the Informedia Digital Video Library. Each news story recorded from the evening news is assigned to one of 3178 topic categories using a K-nearest neighbor classification algorithm. In preliminary tests, the system achieved recall of 0.491 with relevance of 0.482 when up to 5 topics could be assigned to a news story.

KEYWORDS: Topic detection and labeling, topic spotting and classification, video library, digital libraries, broadcast news story indexing.

INTRODUCTION

The Informedia Digital Library Project [1,2] allows full content indexing and retrieval of text, audio and video material. By integrating technologies from the fields of natural language understanding, image processing, speech recognition and video compression, the Informedia digital video library system allows comprehensive access to multimedia data. News-on-Demand is a particular collection in the Informedia Digital Library that has served as a test-bed for automatic library creation techniques. As of March 1998, the Informedia project had about 1.2 terabytes of news video indexed and accessible online, with 1052 news broadcasts containing 21554 stories.

The Informedia digital video library system has two distinct subsystems: the Library Creation System and the Library Exploration Client. The library creation system runs every night, automatically capturing, processing and adding current news shows to the library. It is during the library creation phase, that topics for news stories are automatically assigned to incoming stories. The user can later browse these stories and topics using the library exploration client.

Topics in IDVL

While the original Informedia system allows a search of the full transcript text associated with audio portion of the video, until now, no attempt had been made to classify the news stories into topic categories. Users of the system repeatedly expressed the desire that the large amount of available data should be categorized to aid in understanding the corpus and searching it effectively.

Related Research on Topic Detection

The work reported here is similar in spirit to an approach reported by Schwartz [4], who classifies news stories into a static set using a Hidden Markov Model approach and found that to be somewhat better than a naïve Bayesian approach. Yang [7] also reports on other techniques, which try to cluster news stories into stories of similar topic content. This work differs in that the topic categories here are defined a priori, and do not change over with different data sets. We felt this would better reflect the user needs, than a clustering approach, which could yield different clusters on different days, depending on the contents of the corpus.

DATA

The data for the experiment reported here came from a set of CD-ROMs of broadcast news transcripts, published by Primary Source Media [8]. These data were used for training the system, and a separate held-out set was used for the evaluation results reported below. The online Informedia system uses actual broadcast video, for which no manual topic labels are available, however, the data is of the same type as on the CD-ROM.

From this CDROM, we used 34671 news stories from 1995 as training data. Each of the news stories had one or more topic labels associated with it. Of these topic labels, we selected the top 3178 unique topics, which occurred at least 10 times in the whole corpus. Topics with fewer instances were viewed as idiosyncratic and ignored in the experiments. For testing the accuracy of the topic assignment, 11811 news stories from 1996 up to April were used. A typical story is given in the following paragraph:

"Gossip columnists in Hong Kong have been deprived of juice for the past three days. Hong Kong's Performing Artists Guild decided to boycott reporters and requested that artists stop talking to them. A statement by the artists, published in newspapers on Monday, said the purpose of the boycott was to silently protest the harm done to their dignity and the invasion of their personal freedom. Of course the artists don't seem to mind cooperating with coverage of the boycott. In fact, two dozen stars came out of a meeting on Monday with gags over their mouths. They then posed for pictures. The gossip sheets won't have to suffer too long. Their sources will start talking again at midnight tonight. That's when the boycott ends."

For the above story, human transcribers assigned the following three topic labels: "Performing Arts;" "Travel & Leisure;" and "China".

METHOD

The algorithm for the topic-labeling module was based on a k-nearest neighbor (KNN) strategy [6,7]. The process is split into a training and a classification phase. The training phase only occurs once, but each incoming story document must be classified separately.

During the training phase, the system received as input a set of 34671 broadcast news stories from the year 1995, which already had (manually) assigned topics. On average, each news story document had 5.48 multiple topics assigned to it. Each news story was preprocessed which removed stop words, and each word was converted into its stemmed root form. The entire set of documents was then indexed using a vector space search engine (SMART) [3]. The weighting scheme used in SMART was "mnc".

During the classification phase each new, unclassified news story was also preprocessed to remove stop words and convert words into their root stems. The unclassified document was then vectorized into the SMART vector space using the "ltc" weighting scheme. A distance between the unclassified news story vector and each of the training story vectors was computed using the cosine similarity measure.

The 10 top ranked training documents were selected based on their close similarity to the unclassified news story. Every topic assigned to these top-10 training documents is assigned the same similarity score as the training document itself. The similarity scores of multiple instances of the same topic in several top-10 stories were summed for the topic. The final topic similarity score was used as the topic relevance score, providing a relevance of the topic to the new, previously unclassified document. The top 5 relevant topics above a threshold relevance of 0.8 were selected as the topics to label the new story. These topic labels were then added to the indexed Informedia News-on-Demand database, which then allows searching on topics, as well as browsing.

RESULTS

Recall and relevance were measured for an independent test set of 11811 news stories from 1996, and thus were more recent than the training data from 1995. Each story had topics assigned to it by the human labelers, identical to the type of topic labels that had been applied to the training data. We compared those manual topic labels with the top 5 topics generated by the KNN method. In other words, of the topics that the KNN method generated, how many were the same as the ones assigned by humans (precision) and how many of the human assigned topics did we correctly assign using the KNN method (recall). At 5 topics, the KNN system recall was 0.491; and relevance was 0.482. The scoring was literal, and as a result many near misses were scored as errors, e.g. "Travel" would be an incorrect label for the above story.

CONCLUSIONS

In summary, we found the approach to be promising and the results encouraging. However, there are drawbacks to the use of manual news topics from a limited epoch, which directly reflect current issues (E.g. the Princess of Wales or O.J. Simpson figured prominently as topic categories at particular times). In the long term, we would like to shift away from the ad-hoc set of topics used in the broadcast news transcript CDROM, to a carefully defined set of hierarchical categories. Possible candidates are the Dewey Decimal Classification system or the Library of Congress Classification Scheme, which are popular in libraries around the world. We also would like to provide a tight integration of the topic classification into the browsing and navigation component of the Informedia system, instead of merely allowing users to browse or search for these topics.

ACKNOWLEDGMENTS

This paper is based on work supported by the National Science Foundation, DARPA and NASA under NSF Cooperative agreement No. IRI-9411299. Thanks also to Yiming Yang for her insightful discussions.

REFERENCES

Christel, M., Kanade, T., Mauldin, M., Reddy, R., Sirbu, M., Stevens, S., and Wactlar, H.", "Informedia Digital Video Library", Communications of the ACM", 38 (4), April 1994, pp. 57-58.

The Informedia Digital Video Library Project http://www.informedia.cs.cmu.edu/

Salton, G., Ed, "The SMART Retrieval System", Prentice-Hall, Englewood Cliffs, 1971.

Schwartz, R., Imai, T., Kubala, F., Nguyen, L., and Makhoul, J., A Maximum Likelihood Model for Topic Classification in Broadcast News, Eurospeech-97 – 5^th European Conference on Speech Communication and Technology, Rhodes, Greece, September 1997.

Thompson, R., Shafer, K., and Vizine-Goetz, D., Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment, http://orc.rsch.oclc.org:6109/eval_dc.html

Yang, Y., Carbonell, J. G., Allan, J., Yamron, J. Topic Detection and Tracking: Detection-Task, Project report on the TDT Workshop, Oct 1997.

Yang, Y., An Evaluation of statistical approach to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997.

Primary Source Media, Broadcast News CDROM, Woodbridge, CT, 1995, 1996