Multimodal Information Systems and GIS: The Informedia Digital Video Library

*Submitted to the 1999 ESRI User Conference (San Diego, CA, July 27 – July 30, 1999)

Andreas M. Olligschlaeger

Computer Science Dept. and H. John Heinz III School of Public Policy and Management
Carnegie Mellon University
Pittsburgh, PA 15213 USA
+1 412 268 4151
olli@cs.cmu.edu

Alexander G. Hauptmann

Computer Science Dept.

Carnegie Mellon University

Pittsburgh, PA 15213 USA

+1 412 268 1448

hauptmann@cs.cmu.edu

ABSTRACT

The Informedia Digital Video Library currently contains over 2000 hours of video and is growing at a rate of approximately 15 hours per day. The stored video is automatically processed to extract meta data which describes the content of the video in a variety of ways. One recent extension to the video processing is the extraction and geocoding of locations mentioned in the transcript of the video. This paper describes how the video is geocoded and illustrates how geographic context can be queried and mapped from within the Informedia user interface.

Keywords

Digital video library, geographic context extraction, multimedia abstraction, geocoding

INTRODUCTION

The Informedia Digital Video Library project was initiated under the Digital Libraries Initiative (DLI) in 1994 by the National Sciences Foundation, DARPA and NASA. The goal of the project is to provide full content search and retrieval from digital video, audio and text. In order to accomplish this goal Informedia employs a variety of techniques derived from a number of disciplines. In particular, speech recognition, image processing and natural language understanding techniques are used to automatically extract meta data from video (Christel and Olligschlaeger, 1999). The term "meta data" in this paper refers to descriptive information about video.

Since 1994 the project has been digitizing video from a variety of sources such as CNN news, NASA, the Discovery Channel and United States government agencies (Christel, 1999). The video library currently contains over 1.5 terabytes of data and meta data, representing over 2000 hours of video.

Anyone familiar with web search engines can appreciate how difficult it can be to retrieve information from vast amounts of data. The same is true for Informedia. Although news video is segmented into separate stories, simple text-based queries for particular topics can result in the retrieval of hundreds of news segments. Naturally, as the video library grows, so does the number of query results.

The informedia user interface therefore incorporates a number of visualization techniques that allow the user to quickly narrow down the search to a smaller set of news segments without having to traverse a list of results, as one would normally do using a web search engine (Christel and Olligschlaeger, 1999). One such visualization technique is the "Visualization by example" (VIBE) interface due to Olsen et al (199?). VIBE allows the user to more narrowly define the list of candidates based on the length of the segment, the date, as well as a measure of relevance to the original query (Ahlberg and Schneidermann, 1994).

A recent addition to the set of query and visualization techniques in Informedia are maps. News broadcasts, documentaries and other video media contain numerous references to geographic locations. Adding geographic context to the meta data in Informedia effectively allows the user to not only visualize the geographic extent of a news segment, but also permits the user to perform map based queries of video contained in the Informedia library. In addition, maps in the Informedia client application are interactive: as video segments are played back, places are highlighted on the map when they are mentioned in the video.

The remainder of this paper describes some of the challenges encountered in geocoding video, as well as the current process used to extract geographic coordinates.

EXTRACTING META DATA FROM VIDEO

The meta data extraction process in Informedia begins by retrieving the transcript of the video and aligning each word with the time (in milliseconds from the beginning of the video) when it is mentioned (see Wactlar et al, 1998, for more details). Transcripts are obtained in one of two ways: either by capturing closed captioned text, if available, or via speech recognition. Informedia currently uses Carnegie Mellon University's Sphinx III system. As with all or most speech recognition systems, its accuracy is inversely proportional to the amount of time devoted to processing. For example, a processing effort of 30 times real time using evening news broadcasts results in a word error rate of approximately 35% (including insertions, deletions and substitutions), whereas an effort of 300 times real time yields a word error rate of about 24% (Wactlar et al, 1998). The video transcript is the primary source of information for the geocoding algorithm.

Next, the video is segmented into stories. For example, a one hour news broadcast on CNN contains many different stories on various topics. On the average, each video in the Informedia library consists of about 40 different segments.

For geocoding purposes, one additional source of information is used. Often, the location where a news story was filmed is not actually mentioned in the story. Instead, a line of text superimposed on the video may indicate the name of the correspondent, as well as the location from which he/she is reporting, such as "Ted Koppel, Jerusalem". In order to capture this potentially useful information video optical character recognition (VOCR; see Sato et al, 1998) is used to scan frames within the video for possible text, and then to extract the text and include it in the meta data. As with the transcript, text derived via VOCR is synchronized with the video.

GEOCODING CHALLENGES

Traditional address matching typically involves finding a match within an address coverage based on a street number, prefix, street name, suffix, and street type. This information is usually contained in a table with well defined columns. Street address matching is a fairly straightforward process that has been thoroughly researched and documented. A number of utilities (such as Arc/Info’s Addressparse command) exist to ensure that an address is in the correct format so that it can be matched to an address coverage. Typical match rates range from 75% to 80%, depending on the quality of the address data to be matched and the completeness of the address coverage, and can be as high as 98% (Olligschlaeger, 1997).

Extracting and geocoding locations from free form text such as video transcripts poses a number of challenges that are not normally encountered when matching street addresses. In Informedia not only are there multiple sources of information which can be used for extracting geographic context (speech recognized transcripts, closed captioning, VOCR), but each source of information is in a different format. For example, transcripts derived from closed captioning tend to be relatively error free and include punctuation. Speech recognized transcripts, on the other hand, are only about 75% accurate and contain no punctuation (punctuation can contain vital clues for geocoding, as will be explained in a later section). In addition, they can contain spelling errors due to the fact some words with different meanings are pronounced the same (such as "meet" and "meat", or "discussed" and "disgust"), and, in the case of Sphinx often mistake other words for locations (for example, Lahore is often substituted for another word). Video OCR poses an entirely different problem. Most often titles or locations derived from video OCR consist of one line. This makes geocoding via a language model such as entity extraction very difficult, mainly because usually on two or three words are derived, which in preliminary trials has proven not to be enough to correctly extract entities in most cases. In addition, often place names are spelled incorrectly because some of the characters are not recognized correctly. However, solutions do exist to get around this problem.

An earlier version of the geocoding algorithm was used to process speech recognized text and video OCR (see Christel and Olligschlaeger, 1999 for details). This paper documents the process of geocoding closed captioned text.

One challenge encountered in geocoding freeform text is that often first or last names, or the names of organizations are also the names of places. One common example is "Vernon Jordan". We therefore need to be able to distinguish between words that are a part of the name of an individual and those that are a part of the name of a place. How this is accomplished is discussed in the next section.

A problem common to both street address matching and free form text geocoding is the use of aliases. For example, an alias for 100 Smith Street could be "Joe's Donut Emporium". This is easily handled by the use of alias files. Similarly, in video transcripts, the country "United Kingdom" may be mentioned in several different ways, including "UK", "Great Britain", "Britain", "British", etc.

Finally, place names can be ambiguous. In traditional street address matching ambiguous addresses can be resolved by including other identifiers, such as the name of the city or the zip code, for example. Such indicators are not always present in free form text. Consider the sentence "President Clinton was in Georgia last week to present Washington's views on how to economically restructure the former Soviet Bloc countries". To most people it would be fairly obvious that the two places mentioned in the sentence refer to the country Georgia (as opposed to the state of Georgia) and Washington, DC (as opposed to Washington State). Even if Georgia and Washington have been correctly identified as places we still need to determine which Georgia and which of the approximately 15 Washingtons contained in ESRI's world gazetteer are the ones we are interested in.

The geocoding process in Informedia used to tackle the above mentioned problems and challenges includes the following steps:

  1. Transcript extraction by video segment
  2. Identification of known places and geographic term expansion (address coverage)
  3. Entity Extraction
  4. Disambiguation of extracted places
  5. Statistics gathering and video synchronization
  6. Coordinate matching

TRANSCRIPT EXTRACTION

The extraction of video transcripts proceeds largely as described above. First, the closed captioned text of a video is extracted using a capturing device. Next, the video is segmented into stories, the text synchronized with the video, and the resulting meta data is inserted into the Informedia database. Geoprocessing occurs at the segment level, i.e., one news story is processed at a time, and all geocoded information is specific to that segment.

ADDRESS COVERAGE

The address coverage used for geocoding in Informedia is a subset of ESRI's world gazetteer. This subset currently consists of all countries and administrative areas worldwide, as well as approximately 81,000 cities, towns and villages. Each record in the address coverage includes other information on a place. Columns used for geoprocessing in Informedia are the country name, the type of place, the administrative area and the continent.

For all countries we have added geographic term expansions to account for the different ways in which a country might be mentioned in a news broadcasts. For example, the term "Germany" is expanded to also include "German" and "Germans".

ENTITY EXTRACTION

Hidden Markov Models (HMM's) have been successfully applied to speech recognition for a number of years (see Witbrock and Hauptman, 1997, and Lee, 1988 for an overview). More recently HMM's have also been applied to the extraction of people, places and organizations from text (Kubala et al, 1998; Bikel et al, 1996), as well as dates, times, monetary amounts and percentages. This process is also known as named entity extraction.

A number of commercially available named entity extractors have been built, including systems by BBN and MITRE (see Burger et al, 1998). The named entity extractor used for geoprocessing in Informedia was developed at Carnegie Mellon University and is in part based on one of the first versions of BBN's NYMBLE system, as described in Bikel et al (1996). The current version extracts names, locations and organizations. One of the major advantages of NYMBLE is that training produces a different language model for each type of entity, i.e., co-occurrence probabilities between word pairs differ across entity types, and the probability of a word being the first word of an entity (as in "New findings were discovered today…" and "New York is one of the largest cities…") also differs.

Enhancements to the NYMBLE system include the introduction of prior probabilities based on known places, names and organizations, trigram based Viterbi searches, and a different back-off model for unseen words and word pairs. These enhancements resulted in improved generalization, i.e., the system was better able to handle unseen data, which in turn improved the overall accuracy of the system for geocoding purposes.

The hidden Markov model for our entity extractor was trained using BBN training data consisting of approximately 100 hours of news broadcasts. Entities mentioned in the transcripts contained in the training data set are tagged using the Universal Transcript Format (UTF). Important to note is that we used only punctuated text, i.e., transcripts containing commas, periods and semicolons. Punctuation provides significant clues for entity extraction, as does word capitalization. However, word capitalization was not considered during training due to the fact that closed captioning is usually in upper case only. Including prior probabilities and the training data our entity extraction model consists of about 39,000 unique words and 240,000 word pairs.

Entities are extracted from text by parsing each sentence, one at a time, and using a Viterbi search to tag each word as belonging to one of four entities (people, places, organizations and other) so that, given the probabilities derived during training, the transitional probabilities from one state (word/entity) to another are maximized.

At the time of writing our entity extractor is approximately 80%-85% accurate in extracting locations from text not used during training. We anticipate that the accuracy will improve as the hidden Markov model is trained on more data. However, the current degree of accuracy is sufficient for geocoding purposes. Since geoprocessing occurs at the segment, i.e., story level, we only need to be able to correctly identify a location once within the news segment in order to capture it. Often places are mentioned several times within a story, increasing the chances that a place will be correctly tagged at least once. A more serious source of error are words that are incorrectly tagged as places, i.e., false positives. This most often occurs when a word is at the beginning of a sentence and is also a last name, as in "Washington crossed the Delaware River".

DISAMBIGUATION

Once words, or sequences of words (such as "South Carolina") are tagged as locations they are extracted from the text and become candidates for address matching. For each candidate, an attempt is made to find at least one match in the address coverage. If no match is found, the candidate is discarded (in most instances this due to a false positive during entity extraction; in a few cases it is due to a place not being in the address coverage, even though it exists). If one match is found we assume that it is the correct place. If more than one match is found the extracted place is ambiguous.

In order to resolve which of the places matching the name of an ambiguous location is the correct one we look for clues in the transcript of the segment that might help us resolve the issue and assign scores to each place based on how many clues were found. First, we scan the transcript for any mention of the administrative region. For example, if there are two instances of Salem in the address coverage, one in Ohio and the other in Massachusetts, we scan the transcript for any mention of Ohio or Massachusetts. Each time one of the two states are mentioned, the corresponding Salem is given a point. If, after scanning the text one of the two places has more points than the other, then we assume that the one with more points is the correct location.

In the case where an extracted location is still ambiguous, we scan the transcript for mentions of the country in which a place is located, assigning points in the same manner as above. This usually resolves the ambiguity. In a few instances a location can still be ambiguous. In that case we use a default location. For example, if the ambiguous place is Washington, and, after finding no other clues in the transcript as to which Washington it might be, Washington DC is used as the default match.

Defaults were selected based on a number of factors, including how often they are mentioned in the news, population size, proximity to large cities and the country in which they are located.

STATISTICS AND SYNCHRONIZATION

One of the criteria used in the Informedia user interface for segment retrieval is the relevance of the segment to the query (Christel and Olligschlaeger, 1999). For map based queries this is simply the number of times a place was mentioned in the story. Therefore we count the number of times each candidate is mentioned in the transcript.

In addition, we compute the time in milliseconds from the beginning of the video of the beginning of the first sentence in which a place is mentioned, as well as the end of the last sentence in which it is mentioned. If a place is only mentioned once, we compute the starting and ending times of that sentence. This allows us to animate maps during playback, highlighting places as they are mentioned in a news story.

Finally, we collapse geographic term expansions for countries so that each country is represented only once with its proper name. This is necessary for relating the polygon coverage of countries to the list of matched places and occurs even if the proper name of the country was not mentioned in the transcript. For example, if "German" was mentioned two times, "Germans" once, and "Germany" zero times then the final matched record consists of the term "Germany" with a frequency of three.

COORDINATE MATCHING

After the entire transcript of a news segment has been processed in the manner described above, all successfully matched locations are added to the Informedia meta data. The record for each location also contains the library, movie and segment identifiers, as well as the proper country name, administrative region and continent. The latter fields permit the creation of choropleth maps to visualize, for example, which countries are mentioned most often in a particular topic and to produce change maps over time (see Christel, 1999 for further details).

GEOCODING ACCURACY

In order to determine the accuracy of the geoprocessing algorithm we randomly selected 200 CNN news segments representing about 5 hours of video from the Informedia library and geocoded them. None of the transcripts contained in the 200 segments were used during training of the entity extractor.

A total of 357 places were mentioned in the 200 segments. Of these, the geocoding algorithm correctly identified and matched 269, or 75%. This is approximately on par with street address matching. Of the 88 locations (25%) that were incorrect the following error sources were identified:

These initial results are quite encouraging, especially considering that many of the errors could be eliminated by simply adding places to the address coverage. In addition, we anticipate that the error rate will be reduced further by expanding the number of transcripts used to train the entity extractor.

FUTURE WORK

Although the results described in the previous section are promising, there are many areas in which the geocoding algorithm could be improved. Apart from the entity extractor one other major source of error is the disambiguation procedure. We would like to use a more formal approach to disambiguation than the current one. Specifically, a probability based model that examines co-occurences of places, people and organizations would likely yield better results. For example, if the words "White House", "Clinton" and "Pentagon" are mentioned in the same sentence as the term "Washington", then the likelihood that Washington, DC is the correct place would be greater than, say, Washington State.

A further interesting addition would be the ability to geocode images rather than text. Other members of the Informedia team are currently working on image recognition algorithms that would allow us to identify places such as the White House, Golden Gate Bridge, Taj Mahal and the leaning tower of Pisa.

ACKNOWLEDGMENTS

This material is based on work supported by the National Science Foundation, DARPA and NASA under NSF Cooperative Agreement No. IRI-9411299. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of DARPA nor the other sponsoring agencies. The support of Informedia partners and team members has been invaluable; further information on Informedia-related efforts and a complete list of contributors can be found at www.informedia.cs.cmu.edu. Special thanks to Mike Christel for his valuable suggestions, as well as Chang Huang and Anna Vtolchkina for their work on the map interface and address coverage.

REFERENCES

Ahlberg, C. and Shneiderman, B. (1994): "Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays." In Proc. ACM CHI Conference in Human Factors in Computing Systems. Boston, 313-322.

Bikel, D.M., Miller, S., Schwartz, R., and Weischedel, R. (1996): "Nymble: A High Performance Learning Name-Finder." Proc. 5th Conference on Applied Natural Language Processing, Association for Computational Linguistics, 194-201

Burger, J,D, Palmer, D., and Hirschman, L. (1998) "Named Entity Scoring for Speech Input." To be published

Christel, M.G. (1999): "Visual Digests for Video Libraries." Submitted to Proc. ACM Mutlimedia 1999, Orlando

Christel, M.G. and Olligschlaeger, A.M. (1999): "Interactive Maps for a Digital Video Library." To be published in Proc. IEEE International Conference on Multimedia Computing and Systems, June 1999

Kubala, F., Schwartz, R., Stone, R., and Weischedel, R. (1998): "Named Entity Extraction From Speech." BBN Technologies, Cambridge, MA

Lee, K. (1988): "Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx System." Ph.D. Thesis, Carnegie Mellon University

Olligschlaeger, A.M. (1997): "A Spatial Analysis of Crime Using GIS-Based Data: Weighted Spatial Adaptive Filtering and Chaotic Cellular Forecasting with Applications to Street Level Drug Markets." Ph.D. Thesis, Carnegie Mellon University

Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., and Williams, J.G. (199?): "Visualization of a Document Collection: The VIBE System." Information Processing and Management, 29(1), 69-81

Sato, T., Kanade, T., Hughes, E., and Smith, M. (1998): "Video OCR for Digital News Archive." In Proc. Workshop on Content-Based Access of Image and Video Databases, Los Alamitos, CA, 52-60.

Wactlar, H., Christel, M.G., Gong, Y. and Hauptmann, A.G. (1999): "Lessons Learned From Building a Terabyte Digital Video Library." IEEE Computer, 32(2), 66-73.

Witbrock, M. and Hauptmann, A.G. (1997): "Using Words and Phonetic Strings for Efficient Information Retrieval From Imperfectly Transcribed Spoken Documents." Proc. ACM Digital Libraries, 30-35