Generalized Example-Based Machine Translation

2. Our Implementation of Example-Based Machine Translation

Other EBMT systems operate on parse trees, or find the most similar complete sentence and modify its translation based on the differences between the sentence to be translated and the matched example. Because our system was designed from the outset to be one engine in a multi-engine translation system, we were able to use a different approach. Since the EBMT engine is not required to produce output for every word of the input -- one of the other engines can cover for it -- it can produce partial translations.

The basic EBMT system that we are generalizing performs partial exact matches against the examples it has been given, and relies on the multi-engine architecture to assemble the partial translations. Its training consists of indexing the source-language half of each translation example in an inverted index -- a listing, for every possible word, of all the locations at which the word occurs. When asked to perform a translation, it first finds every exactly-matching phrase in its database, regardless of any overlap with any other matches (it always selects the longest match in any particular translation example). To find the matching phrases using an inverted index, one simply retrieves the occurrence lists for the first two words, and determines which of the occurences of the first word are adjacent to an occurrence of the second word. Then one retrieves the occurrence list for the third word, and either extends the match or creates a new one where the third word appears adjacent to the second one in the example base. Repeat until the entire input text has been processed.

[Diagram of the EBMT Paradigm]

Figure 1: Our Approach to EBMT

Once all matching phrases have been found, the next step is to determine what portion of the target-language half of each matched example corresponds to the matched part of that example. If the entire example sentence has been matched, this correspondence is trivial to determine: it is the entire target-language half of the example. If only a portion of the example has been matched, the correspondence becomes much trickier -- finding the translation is then the equivalent of performing a word-level alignment between the two halves of the example, a problem which is still being actively researched by many people.

In our EBMT engine, we use a simple, heuristically-based word-level alignment algorithm which, while far from perfect, is nevertheless adequate for the task. The alignment algorithm starts by constructing a table of possible correspondences between individual words, based on a bilingual dictionary. Then, it examines the correspondence table to determine the shortest and longest possible translations of the matched chunk. For every contiguous substring of the longest possible translation that includes the shortest one, a set of simple scoring functions is applied, and the substring with the best score is declared the proper translation of the chunk. More details on the alignment algorithm will be given below.

For run-time efficiency, the EBMT engine does not actually process every single match -- some short common phrases such as "of the" can occur thousands of times in the example base. First, it looks at a maximum of eight matches for any particular phrase in the input, and second, it stops as soon as it finds one example for which the scoring metric indicates a perfect alignment between the source-language and target-language chunks (both of these numbers are set in the program's configuration file). The candidate matches are processed starting with the most recently added one, proceeding backwards to the candidate which was the earliest to be added to the example base. This ordering was selected to permit later additions to have priority over older examples, and supports the correction of translation errors by incremental addition of corrections to the indexed example base (including on-the-fly updates at run-time).

How efficient is the EBMT engine? Its run-time speed obviously depends on the speed of the computer on which it is run, and (less obviously) on the size of the example base -- run time increases approximately in proportion to the amount of text in the example base since the index grows in proportion and some additional phrases can be matched as the example base grows. Our typical configuration has an example base of 10 to 20 megabytes of text, which is sufficient for reasonably good coverage of unrestricted input texts. With example bases of such sizes, the engine can translate upwards of 500 words per minute on even the slowest Pentium processor, and 5000 or more words per minute on a 300 MHz UltraSPARC or Pentium II. This is easily fast enough for real-time translation of a conversation between two people who do not speak the same language.

Next: Generalization

[LTI Home Page] [EBMT Main Page] [Introduction] [Generalization] [Results]
(Last updated 04-Aug-99)