Trigger Toolkit
Adam Berger
April, 1997
Overview
The software described herein is for finding pairs of words which co-occur with
high frequency in text. Starting from a corpus of text---several million words
is what we have in mind---the program automatically discovers ordered pairs of
words where the occurrence of the first word in a pair makes the subsequent
appearance of the second word much more likely than it otherwise would be. For
example, the toolkit might discover that the word "patient" augurs the imminent
appearance of "drug." To rank pairs of words, the toolkit uses mutual
information. For more details about mutual information and why triggers might
be useful, see the references.
What You'll Need
- A corpus of text. The toolkit doesn't care how it's stored, but you'll have
to supply a function to read individual words from the corpus. The toolkit was
written with the ability to handle corpora on the order of hundreds of millions
of words.
- A list of words (a "vocabulary"). The format of this file is ascii,
one word spelling per line.
- Computer resources: Foremost is a disk with sufficient space. Depending on
the size of the corpus and vocabulary, this may be 1GB or more. The toolkit is
able to handle various sizes of physical memory, but will run more efficiently
with more memory.
What You'll Get
A file containing triplets: (s-word, t-word, mutual information). For instance,
changes revisions 0.00232
Soviet missiles 0.00113
medical surgical 0.00109
Gulf Gulf 0.00108
I me 0.00102
On the Switchboard corpus of roughly 3 million words, this toolkit requires
about 6 hours on a modern workstation to extract a ranked list of word
pairs. Larger corpora will take longer, but there is a way, described below,
to exploit parallelism to speed up the task.
How To Do It
[We'll call the directory in which you've un-tarred the toolkit $INSTALLDIR]
- Edit ${INSTALLDIR}/Configure.H, which contains user-specific information
such as the minimum number of co-occurrences required for a (s,t) pair to be
considered and the code for reading words from the corpus. If all goes well,
this should be the only file you need to edit.
- Run ${INSTALLDIR}/make to generate the executables
- Run ${INSTALLDIR}/writeScripts to generate the scripts. Run without
arguments to get the proper arguments. The number of scripts generated will
depend on the size of the vocabulary and the number of bytes you are allowing
the tool to use (specified in Configure.H).
- Execute master_script, which runs all the subtasks sequentially. See
"Exploiting parallelism," below.
- There are several built-in sanity checks. For instance, the "a" jobs display
the spelling of the first few words from the corpus. Make sure these look
reasonable, in case there's some off-by-one error in the user-supplied
corpus-interface code in Configure.H.
Details
The trigger toolkit is highly configurable. Here's what you can adjust (all in
Configure.H), along with their default values:
MAX_BYTES (150000000) // the "findPairs" step will try to allocate this
// much memory on each processor.
COUNT_CUTOFF (8) // each (s,t) pair must co-occur at least this many
// times in order to be considered.
MAX_WORDS (100000000) // how many words from the corpus to process. Set
// to a very large value to process the entire thing
MAX_WINDOW (500) // count (s,t) only when they co-occur within this
// many words.
MIN_WINDOW (3) // ...but no more than this close
MI_THRESHOLD (0.00001) // discard (s,t) with mutual information below this
ARTICLE_BOUNDARY_WORD ("") // An (s,t) pair cannot overlap one of these
INDEX_OF_FIRST_WORD (1) // Change this if the 1st 10 words printed out look
// nonsensical
You shouldn't really need to touch any code other than Configure.H.
Appendix: What's included
- Makefile : generate executables for your machine
- writeScripts.C : generate all scripts necessary for a full execution
- findPairs.C : implements phase A of computation: find (s,t) pairs which
co-occur sufficiently often. For efficiency, the full task is broken up
into many sub-parts, the results of which are subsequently merged.
- rankPairs.C : implements phase B of computation: rank the (s,t) pairs
according to mutual information.
- Configure.H : user-configurable information.
- Trigger.H
- WordList.H you shouldn't need to worry about these.
- Architecture.H
Exploiting Parallelism
The "How to do it" section above describes how to run the trigger toolkit in
"pushbutton" mode, with no intervention from the user. This section is for
those who would like to speed up the task by exploiting parallelism.
The stages of the task are
- (a) finding a list of (s,t) pairs: This task is broken into a series of
subtasks (scripts a.0, a.1, a.2 ...), each of which
is responsible for finding the word pairs (s,t) with t in some portion of the
full range of words. These scripts are fully independent of one another and
thus can be run in parallel on many machines.
- (b) merging the results of the "a" steps into one file (via a Unix "cat")
- (c) ranking the found (s,t) pairs: this task is broken into a series of
rounds; each round handles just as many (s,t) pairs as fit in memory. Running
master_script will cause the rounds to be executed sequentially. You may
override this by adding a final (optional) parameter to the command line:
rankPairs (corpus) (pair file) (vocabulary) (output file) [round]
if 'round' is larger than the number of required rounds, the script will do
nothing.
Further Reading
This reference
contains a discussion of why mutual information is an appropriate way to gauge
the correlation between words. It also contains one approach for incorporating
trigger pairs into a language model.
Shortcomings
As with all software, this toolkit was written with a certain type of input in
mind. Specifically, we envision a corpus of between 1 and 500 million words,
and a vocabulary of roughly 50,000 words. The toolkit requires a few days to
run (exploiting parallelism) on inputs on the far end of this scale. For small
input sizes, the toolkit may not perform as well as some another tool designed
specifically for the task.
Caveat
This software is made available for research purposes only. It may be
redistributed freely for this purpose, in full or in part, provided
that the entire copyright notice is included on any copies of this
software and applications and derivations thereof.
This software is provided on an "as is" basis, without warranty of any
kind, either expressed or implied, as to any matter including, but not
limited to warranty of fitness of purpose, or merchantability, or
results obtained from use of this software.
Send questions, comments, suggestions to aberger@cs.cmu.edu.