Trigger Toolkit

Adam Berger

April, 1997

Overview

The software described herein is for finding pairs of words which co-occur with high frequency in text. Starting from a corpus of text---several million words is what we have in mind---the program automatically discovers ordered pairs of words where the occurrence of the first word in a pair makes the subsequent appearance of the second word much more likely than it otherwise would be. For example, the toolkit might discover that the word "patient" augurs the imminent appearance of "drug." To rank pairs of words, the toolkit uses mutual information. For more details about mutual information and why triggers might be useful, see the references.

What You'll Need

  1. A corpus of text. The toolkit doesn't care how it's stored, but you'll have to supply a function to read individual words from the corpus. The toolkit was written with the ability to handle corpora on the order of hundreds of millions of words.
  2. A list of words (a "vocabulary"). The format of this file is ascii, one word spelling per line.
  3. Computer resources: Foremost is a disk with sufficient space. Depending on the size of the corpus and vocabulary, this may be 1GB or more. The toolkit is able to handle various sizes of physical memory, but will run more efficiently with more memory.

What You'll Get

A file containing triplets: (s-word, t-word, mutual information). For instance, changes revisions 0.00232 Soviet missiles 0.00113 medical surgical 0.00109 Gulf Gulf 0.00108 I me 0.00102 On the Switchboard corpus of roughly 3 million words, this toolkit requires about 6 hours on a modern workstation to extract a ranked list of word pairs. Larger corpora will take longer, but there is a way, described below, to exploit parallelism to speed up the task.

How To Do It

[We'll call the directory in which you've un-tarred the toolkit $INSTALLDIR]
  1. Edit ${INSTALLDIR}/Configure.H, which contains user-specific information such as the minimum number of co-occurrences required for a (s,t) pair to be considered and the code for reading words from the corpus. If all goes well, this should be the only file you need to edit.
  2. Run ${INSTALLDIR}/make to generate the executables
  3. Run ${INSTALLDIR}/writeScripts to generate the scripts. Run without arguments to get the proper arguments. The number of scripts generated will depend on the size of the vocabulary and the number of bytes you are allowing the tool to use (specified in Configure.H).
  4. Execute master_script, which runs all the subtasks sequentially. See "Exploiting parallelism," below.
  5. There are several built-in sanity checks. For instance, the "a" jobs display the spelling of the first few words from the corpus. Make sure these look reasonable, in case there's some off-by-one error in the user-supplied corpus-interface code in Configure.H.

Details

The trigger toolkit is highly configurable. Here's what you can adjust (all in Configure.H), along with their default values: MAX_BYTES (150000000) // the "findPairs" step will try to allocate this // much memory on each processor. COUNT_CUTOFF (8) // each (s,t) pair must co-occur at least this many // times in order to be considered. MAX_WORDS (100000000) // how many words from the corpus to process. Set // to a very large value to process the entire thing MAX_WINDOW (500) // count (s,t) only when they co-occur within this // many words. MIN_WINDOW (3) // ...but no more than this close MI_THRESHOLD (0.00001) // discard (s,t) with mutual information below this ARTICLE_BOUNDARY_WORD ("") // An (s,t) pair cannot overlap one of these INDEX_OF_FIRST_WORD (1) // Change this if the 1st 10 words printed out look // nonsensical You shouldn't really need to touch any code other than Configure.H.

Appendix: What's included

Exploiting Parallelism

The "How to do it" section above describes how to run the trigger toolkit in "pushbutton" mode, with no intervention from the user. This section is for those who would like to speed up the task by exploiting parallelism.

The stages of the task are

Further Reading

This
reference contains a discussion of why mutual information is an appropriate way to gauge the correlation between words. It also contains one approach for incorporating trigger pairs into a language model.

Shortcomings

As with all software, this toolkit was written with a certain type of input in mind. Specifically, we envision a corpus of between 1 and 500 million words, and a vocabulary of roughly 50,000 words. The toolkit requires a few days to run (exploiting parallelism) on inputs on the far end of this scale. For small input sizes, the toolkit may not perform as well as some another tool designed specifically for the task.

Caveat

This software is made available for research purposes only. It may be redistributed freely for this purpose, in full or in part, provided that the entire copyright notice is included on any copies of this software and applications and derivations thereof.

This software is provided on an "as is" basis, without warranty of any kind, either expressed or implied, as to any matter including, but not limited to warranty of fitness of purpose, or merchantability, or results obtained from use of this software.

Send questions, comments, suggestions to aberger@cs.cmu.edu.