Publications, by Topic

Jump to: Information Retrieval, Language Identification, Topic Detection and Tracking, Digital Forensics, Personal Computers, Math and Crypto, Miscellaneous.

Explore my papers in Google Scholar - Ralf Brown.

Machine Translation

---- Anaphora Resolution ----

Jaime G. Carbonell and Ralf D. Brown, "Anaphora Resolution: A Multi-Strategy Approach". In Proceedings of the Twelfth International Conference on Computational Linguistics (COLING'88), pp. 96-101. Budapest [photos], August 1988.
Available in Scribe and PDF format.
Abstract: Anaphora resolution has proven to be a very difficult problem; it requires the integrated application of syntactic, semantic, and pragmatic knowledge. This paper examines the hypothesis that instead of attempting to construct a monolithic method for resolving anaphora, the combination of multiple strategies, each exploiting a different knowledge source, proves more effective - theoretically and computationally. Cognitive plausibility is established in that human judgements of the optimal anaphoric referent accord with those of the strategy-based method, and human inability to determine a unique referent corresponds to the cases where different strategies offer conflicting candidates for the anaphoric referent.

---- Disambiguation ----

Ralf D. Brown, "Augmentation", Machine Translation, 1989, vol 4 #2, pp. 129-147.

Ralf D. Brown, "Augmentation" in K. Goodman, ed. KBMT-89 Project Report. Center for Machine Translation, Carnegie Mellon University. 1989.

Ralf D. Brown and Sergei Nirenburg, "Human-Computer Interaction for Semantic Disambiguation". In Proceedings of the Thirteenth International Conference on Computational Linguistics (COLING'90), vol 3, pp. 42-47. Helsinki, Finland [photos].
Available in Scribe and PostScript format.
Abstract: We describe a semi-automatic semantic disambiguator integrated in a knowledge-based machine translation system. It is used to bridge the analysis and generation stages in machine translation. The user interface of the disambiguator is built on mouse-based multiple-selection menus.

Ralf D. Brown. "Automatic and Interactive Augmentation". In K. Goodman and S. Nirenburg (ed), The KBMT Project: A Case Study in Knowledge-Based Machine Translation. Morgan Kaufmann Publishers, 1991. ISBN 1-55860-129-5.

---- Example-Based Machine Translation ----

Ralf D. Brown, "Example-Based Machine Translation in the Pangloss System". In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), p. 169-174. Copenhagen, Denmark, August 5-9, 1996. (CiteSeer doi:10.1.1.14.296)
Available in LaTeX and PostScript format.
Abstract: The Pangloss Example-Based Machine Translation engine (PanEBMT) is a translation system requiring essentially no knowledge of the structure of a language, merely a large parallel corpus of example sentences and a bilingual dictionary. Input texts are segmented into sequences of words occurring in the corpus, for which translations are determined by subsentential alignment of the sentence pairs containing those sequences. These partial translations are then combined with the results of other translation engines to form the final translation produced by the Pangloss system. In an internal evaluation, PanEBMT achieved 70.2% coverage of unrestricted Spanish news-wire text, despite a simplistic subsentential alignment algorithm, a suboptimal dictionary, and a corpus from a different domain than the evaluation texts.

Ralf D. Brown, "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997. (CiteSeer doi:10.1.1.48.1774)
Available in LaTeX and PostScript format.
Abstract: An Example-Based Machine Translation system is supplied with a sentence-aligned bilingual corpus, but no other knowledge sources. Using the knowledge implicit in the corpus, it generates a bilingual word-for-word dictionary for alignment during translation. With such an automatically-generated dictionary, the system covers (with equivalent quality) more of its input on unseen texts than the same system does when provided with a manually-created general-purpose dictionary and other knowledge sources.

My COLING-ACL'98 paper is also relevant to EBMT.

Ralf D. Brown. "Adding Linguistic Knowledge to a Lexical Example-Based Translation System". In Proceedings of the Eighth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99), p. 22-32. Chester, UK [photos], August 1999. (CiteSeer doi:10.1.1.44.1381)
Available in LaTeX and PostScript format.
Abstract: Example-Based Machine Translation (EBMT) using partial exact matching against a database of translation examples has proven quite successful, but requires a large amount of pre-translated text in order to achieve broad coverage of unrestricted text. By adding linguistically tagged entries to the example base and permitting recursive matches that replace the matched text with the associated tag, substantial reductions in the required amount of pre-translated text can be achieved. A modest investment of time -- on the order of two person-weeks -- adding linguistic knowledge reduces the required example text by a factor of six or more, while retaining comparable translation quality. This reduction makes EBMT more attractive for so-called ``low-density'' languages for which little data is available.

Ralf Brown. "Example-Based Machine Translation at Carnegie Mellon University". In The ELRA Newsletter, European Language Resources Association, vol 5:1, January-March 2000.
Available in PDF format.

Ralf D. Brown. "Automated Generalization of Translation Examples". In Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING-2000), p. 125-131. Saarbrücken, Germany [photos], August 2000. (CiteSeer doi:10.1.1.14.3211)
Available in PostScript and LaTeX format.
Abstract: Previous work has shown that adding generalization of the examples in the corpus of an example-based machine translation (EBMT) system can reduce the required amount of pretranslated example text by as much as an order of magnitude for Spanish-English and French-English EBMT. Using word clustering to automatically generalize the example corpus can provide the majority of this improvement for French-English with no manual intervention; the prior work required a large tagged bilingual dictionary and the manual creation of grammar rules. By seeding the clustering with a small amount of manually-created information, even better performance can be achieved. This paper describes a method whereby bilingual word clustering can be performed using standard monolingual document clustering techniques, and its effectiveness at reducing the size of the example corpus required.

Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". In Proceedings of HLT 2001: First International Conference on Human Language Technology Research, p. 7-10. San Diego, California, March 18-21, 2001.
Available in PostScript.
Abstract: We describe an Example-Based Machine Translation (EBMT) system and the adaptations and enhancements made to create a Chinese-English translation system from the Hong Kong legal code and various other bilingual resources available from the Linguistic Data Consortium (LDC).

Ralf D. Brown. "Transfer-Rule Induction for Example-Based Translation". In Proceedings of the MT Summit VIII Workshop on Example-Based Machine Translation, p. 1-11. Santiago de Compostela, Spain, 18 September 2001. (CiteSeer doi:10.1.1.21.6724)
Available in PostScript.
Abstract: Previous work has shown that grammars and similar structure can be induced from unlabeled text (both monolingually and bilingually), and that the performance of an example-based machine translation (EBMT) system can be substantially enhanced by using clustering techniques to determine equivalence classes of individual words which can be used interchangeably, thus converting translation examples into templates. This paper describes the combination of these two approaches to further increase the coverage (or conversely, decrease the required training text) of an EBMT system. Preliminary results show that a reduction in required training text by a factor of twelve is possible for translation from French into English.

Ying Zhang, Ralf D. Brown, Robert E. Frederking, and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". In Proceedings of the MT Summit VIII. Santiago de Compostela, Spain, September 2001. (CiteSeer doi:10.1.1.67.4098)
Available in PostScript and PDF.
Abstract: Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpus. In this paper, we describe the work we have done to improve the segmentation for Mandarin and the bracketing process for English to increase the length of English phrases. The final results of the corpus pre-processing are a segmented/bracketed aligned bilingual corpus and a statistical dictionary. We achieved positive results by increasing the average length of Chinese terms about 60% and 10% for English. The statistical dictionary gained about a 30% increase in coverage.

Rebecca Hutchinson, Paul N. Bennett, Jaime G. Carbonell, Peter Jansen, Ralf Brown. "Maximal Lattice Overlap in Example-Based Machine Translation", Technical Report CMU-CS-03-138/CMU-LTI-03-174, June 2003. (10.1.1.73.2384)
Abstract: Example-Based Machine Translation (EBMT) retrieves pre-translated phrases from a sentence-aligned bilingual training corpus to translate new input sentences. EBMT uses long pre-translated phrases effectively but is subject to disfluencies at phrasal translation boundaries. We address this problem by introducing a novel method that exploits overlapping phrasal translations and the increased confidence in translation accuracy they imply. We specify an efficient algorithm for producing translations using overlap. Finally, our empirical analysis indicates that this approach produces higher quality translations than the standard method of EBMT in a peak-to-peak comparison.

Ralf D. Brown, Rebecca Hutchinson, Paul N. Bennett, Jaime G. Carbonell, Peter Jansen. "Reducing Boundary Friction Using Translation-Fragment Overlap", in Proceedings of the Ninth Machine Translation Summit, New Orleans, USA, September 2003, pp. 24-31. (CiteSeer doi:10.1.1.68.7166)
Available in Postscript.
Abstract: Many corpus-based Machine Translation (MT) systems generate a number of partial translations which are then pieced together rather than immediately producing one overall translation. While this makes them more robust to ill-formed input, they are subject to disfluencies at phrasal translation boundaries even for well-formed input. We address this "boundary friction" problem by introducing a method that exploits overlapping phrasal translations and the increased confidence in translation accuracy they imply. We specify an efficient algorithm for producing translations using overlap. Finally, our empirical analysis indicates that this approach produces higher quality translations than the standard method of combining non-overlapping fragments generated by our Example-Based MT system in a peak-to-peak comparison.

Ralf D. Brown. ``Clustered Transfer Rule Induction for Example-Based Translation''. In Michael Carl & Andy Way (eds.) Recent Advances in Example-Based Machine Translation (Dordrecht: Kluwer Academic Publishers, 2003), pp. 287-305.

Ralf D. Brown, "A Modified Burrows-Wheeler Transform for Highly-Scalable Example-Based Translation", in Machine Translation: From Real Users to Research, Proceedings of the 6th Conference of the Association for Machine Translation (AMTA-2004), Washington, D.C., USA, September/October 2004, pp. 27-36. Springer, Lecture Notes in Artificial Intelligence, Volume 3265, ISSN 0302-9743.
Available in Postscript and PDF.
Abstract: The Burrows-Wheeler Transform (BWT) was originally developed for data compression, but can also be applied to indexing text. In this paper, an adaptation of the BWT to word-based indexing of the training corpus for an example-based machine translation (EBMT) system is presented. The adapted BWT embeds the necessary information to retrieve matched training instances without requiring any additional space and can be instantiated in a compressed form which reduces disk space and memory requirements by about 40% while still remaining searchable without decompression.
Both the speed advantage from O(log N) lookups compared to the O(N) lookups in the inverted-file index which had previously been used and the structure of the index itself act as enablers for additional capabilities and run-time speed. Because the BWT groups all instances of any n-gram together, it can be used to quickly enumerate the most-frequent n-grams, for which translations can be precomputed and stored, resulting in an order-of-magnitude speedup at run time.

Jae Dong Kim, Ralf D. Brown, Peter J. Jansen, and Jaime G. Carbonell. "Symmetric Probabilistic Alignment for Example-Based Translation". In Proceedings of the Tenth Workshop of the European Association for Machine Translation (EAMT-05), Budapest, Hungary, May 2005.

Ralf D. Brown. "Context-Sensitive Retrieval for Example-Based Translation". In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pp. 9-15. Phuket, September 2005.
Available in Postscript and PDF.
Abstract: Example-Based Machine Translation (EBMT) systems have typically operated on individual sentences without taking into account prior context. By adding a simple reweighting of retrieved fragments of training examples on the basis of whether the previous translation retrieved any fragments from examples within a small window of the current instance, translation performance is improved. A further improvement is seen by performing a similar reweighting when another fragment of the current input sentence was retrieved from the same training example. Together, a simple, straightforward implementation of these two factors results in an improvement on the order of 1.0-1.6% in the BLEU metric across multiple data sets in multiple languages.

Christian Monson, Ariadna Font Llitjos, Roberto Aranovich, Lori Levin, Ralf Brown, Eric Peterson, Jaime Carbonell, and Alon Lavie: ``Building NLP Systems For Two Resource-Scarce Indigenous Languages: Mapudungun and Quechua''. In LREC-2006: Fifth International Conference on Language Resources and Evaluation. 5th SALTMIL Workshop on Minority Languages: �Strategies for Developing Machine Translation for Minority Languages�, Genoa, Italy, 23 May 2006; pp.15-24.

Aaron B. Phillips, Violetta Cavalli-Sforza, and Ralf Brown. "Improving Example Based Machine Translation Through Morphological Generalization and Adaptation." Machine Translation Summit XI, Copenhagen, Denmark, September 2007.
Available in PDF.

Ralf D. Brown, "Exploiting Document-Level Context for Data-Driven Machine Translation". In Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas (AMTA-2008), Waikiki, Hawaii.
Available in PDF.

Rashmi Gangadharaiah, Ralf D. Brown, and Jaime Carbonell. "Active Learning in Example-Based Machine Translation". In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), pp. 227-230. ed. Kristiina Jokinen and Eckhard Bick. Odense, Denmark, May 14-16, 2009.

Aaron B. Phillips and Ralf D. Brown, "Cunei Machine Translation Platform: System Description". In Mikel L. Forcada and Andy Ways, eds., Proceedings of the Third Workshop on Example-Based Machine Translation, Dublin, Ireland, November 12-13, 2009.

Jae Dong Kim, Ralf D. Brown, and Jaime G. Carbonell. "Chunk-Based EBMT". In Proceedings of the 14^th Workshop of the European Assocation for Machine Translation (EAMT-2010). Saint Raphael, France, May 27-28, 2010.

Rashmi Gangadharaiah, Ralf D. Brown and Jaime Carbonell, "Automatic Determination of Number of Clusters for Creating Templates in Example-Based Machine Translation". In Proceedings of the 14^th Workshop of the European Association for Machine Translation (EAMT-2010). Saint-Rapha�l, France, May 27-28, 2010.

Ralf D. Brown, "Taming Structured Perceptrons on Wild Feature Vectors". In Proceedings of the Fifth Workshop on Statistical Machine Translation (WMT'10), Uppsala, Sweden, July 2010.

Ralf D. Brown, "The CMU-EBMT Machine Translation System". In Machine Translation, Volume 25 Number 2 (2011, special issue on Free/Open Source Machine Translation), pp. 179-195. ISSN 0922-6567.
DOI: 10.1007/s10590-011-9095-8
Abstract: This paper presents an in-depth description of the features of the open-source CMU-EBMT example-based machine translation system. CMU-EBMT is a complete end-to-end system including lexicon induction, word and phrase alignment, corpus indexing and lookup, language model, decoder, and parameter tuning components. While it does not require them, it can take advantage of external alignment information and other annotations provided by GIZA++ and other systems. To illustrate a recent addition to CMU-EBMT, experiments are presented which show an improvement of 0.16 BLEU points (0.9% relative) on a cross-validated small-data English-Haitian translation task when using a new set of fine-grained log-linear feature values representing language model match lengths in addition to language model probabilities.

Aaron B. Phillips and Ralf D. Brown. "Training Machine Translation with a Second-Order Taylor Approximation of Weighted Translation Instances." In Proceedings of the Thirteenth Machine Translation Summit (MT-Summit XIII), Xiamen, China, September 2011.

Available as PDF. For more on EBMT, see the pages of the Generalized EBMT project. In addition, I gave a tutorial on EBMT at AMTA-2002, for which the slides are available in PDF format.

---- Multi-Engine Machine Translation ----

R. Frederking, S. Nirenburg, D. Farwell, S. Helmreich, E. Hovy, K. Knight, S. Beale, C. Domashnev, D. Attardo, D. Grannes, and R. Brown. "Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation", in Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA-94). Columbia, Maryland 1994.

Robert Frederking and Ralf D. Brown, "The Pangloss-Lite Machine Translation System". In Expanding MT Horizons: Proceedings of the Second Conference of the Association for Machine Translation in the Americas, Montreal, Canada [photos], 1996. pp. 268-272.
Available in Scribe and PostScript format.
Abstract: The Pangloss-Lite (PanLite) machine translation system is a standalone C++ re-implementation of several major components from the Pangloss machine translation system. It incorporates the Pangloss Example-Based MT (EBMT) and Transfer-Based MT engines, and its statistical language modeller, as well as a newly-implemented morphological analyzer, within the multi-engine MT architecture developed during the course of the project.

---- Statistical/Probabilistic MT ----

Ralf D. Brown, Jae Dong Kim, Peter J. Jansen, and Jaime G. Carbonell, "Symmetric Probabilistic Alignment". In Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond - Proceedings of the Workshop, p. 87-90. Ann Arbor, Michigan, June 29-30, 2005. (CiteSeer doi:10.1.1.69.4943)
Available in PostScript format. Abstract: In this short paper, we outline our basic alignment algorithm and some extensions for using context and positional information, and compare its alignment accuracy on the Romanian-English data for the shared task with IBM Model 4 and the reported results from the prior workshop.

Ralf Brown and Robert Frederking, "Applying Statistical English Language Modelling to Symbolic Machine Translation". In Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI'95), p. 221-239. Leuven, Belgium, July 5-7, 1995. (CiteSeer doi:10.1.1.124.7392)
Available in LaTeX and PostScript format.
Abstract: The PANGLOSS Mark III system was from the outset designed to be a symbolic, human-aided machine translation (MT) system. The need arose to rapidly adapt it for use as a fully-automated MT system. Our solution to this problem was to add a statistical English language model (ELM) to replace the most significant user activity, selecting between alternate translations produced by the system. The language model used is a trigram model with backoff to bigram and unigram probabilities. The language modeling and search procedure are described in detail, and comparison is made to other trigram-based statistical MT work.

Ralf D. Brown, "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997.
Available in LaTeX and PostScript format.
Abstract: An Example-Based Machine Translation system is supplied with a sentence-aligned bilingual corpus, but no other knowledge sources. Using the knowledge implicit in the corpus, it generates a bilingual word-for-word dictionary for alignment during translation. With such an automatically-generated dictionary, the system covers (with equivalent quality) more of its input on unseen texts than the same system does when provided with a manually-created general-purpose dictionary and other knowledge sources.

Ralf D. Brown. "Automatically-Extracted Thesauri for Cross-Language IR: When Better is Worse", In Proceedings of the First Workshop on Computational Terminology (COMPUTERM'98), Montreal, Canada [photos], 15 August 1998, pp. 15-21. (CiteSeer doi:10.1.1.46.7633)
(Held in conjunction with COLING-ACL'98).
Available in PostScript format (7 pages).

Rashmi Gangadharaiah, Ralf Brown and Jaime Carbonell. "Monolingual Distributional Profiles for Word Substitution in Machine Translation", in Proceedings of COLING-2010, August 23-27, 2010, Beijing, China.
Abstract: Out-of-vocabulary (OOV) words present a significant challenge for Machine Translation. For low-resource languages, limited training data further increases the frequency of OOV words and degrades the quality of the translations. Past approaches have suggested using stems or synonyms for OOV words. Unlike the previous methods, we propose handling not just the OOV words but rare words as well in an Example-based Machine Translation (EBMT) paradigm. Presence of OOV words and rare words in the input sentence prevents the system from finding longer phrasal matches and produces low quality translations due to less reliable language model estimates. The proposed method requires only a monolingual corpus of the source language to find can- didate replacements. A new framework is introduced to score and rank the replacements by efficiently combining features extracted for the candidate replacements. The lattice representation scheme allows the decoder to select from a beam of possible replacement candidates. The new framework gives statistically significant improvements in English-Chinese and English-Haitian translation systems.

---- Other Machine Translation Work ----

Kathy Baker, Steven Bethard, Michael Bloodgood, Ralf Brown, Chris Callison-Burch, Glen Coppersmith, Bonnie Dorr, Wes Filardo, Kendall Giles, Ann Irvine, Mike Kayser, Lori Levin, Justin Martineau, Jim Mayfield, Scott Miller, Aaron Phillips, Andrew Philpot, Christine Piatko, Lane Schwartz and David Zajic. Semantically Informed Machine Translation (SIMT), Summer Camp for Applied Language Exploration (SCALE) 2009 Summer Workshop Final Report. Tech report number 002 for the Human Language Technology Center Of Excellence (HLTCOE). PDF.

---- Knowledge Acquisition ----

Ralf D. Brown, "The MikroKARAT Distributed Knowledge Acquisition Environment", presented at the Knowledge Engineering on the Information Highway Workshop, September 1994.

Katharina Probst, Ralf Brown, Jaime Carbonell, Alon Lavie, Lori Levin, and Erik Peterson. "Design and Implementation of Controlled Elicitation for Machine Translation of Low-density Languages", in Proceedings of the MT2010 workshop at MT Summit 2001. Santiago de Compostela, Spain, September 2001.
Available as PostScript and PDF.

Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie, Eliseo Cañulef, and Carolina Huenchullan. "Data Collection and Language Technologies for Mapudungun". In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002). Las Palmas, Gran Canaria, Spain, May 2002.

Jaime Carbonell, Katharina Probst, Erik Peterson, Christian Monson, Alon Lavie, Ralf Brown, and Lori Levin. "Automatic Rule Learning for Resource-Limited MT". In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas (AMTA 2002), pp. 1-10. Tiburon, California, October 8-12, 2002.
Available in PostScript and PDF.
Abstract: Machine translation of minority languages presents unique challenges, including the paucity of bilingual training data and the unavailability of linguistically-trained speakers. This paper focuses on a machine learning approach to transfer-based MT, where data in the form of translations and lexical assignments are elicited from bilingual speakers, and a seeded version-space learning algorithm formulates and refines transfer rules.

Christian Monson, Lori Levin, Rodolfo Vega, Ralf Brown, Ariadna Font Llitjós, Alon Lavie, Jaime Carbonell, Eliseo Cañulef, and Rosendo Huisca. "Data Collection and Analysis of Mapudungun Morphology for Spelling Correction". In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004).

---- Speech-to-Speech Translation ----

Alan W Black, Ralf D. Brown, Robert Frederking, Rita Singh, John Moody, and Eric Steinbrecher, "TONGUES: Rapid Development of a Speech-to-Speech Translation System". In Proceedings of HLT 2002: Second International Conference on Human Language Technology Research, ed. Mitchell Marcus. San Diego, California, March 24-27, 2002, pp. 183-186. (CiteSeer doi:10.1.1.68.9513)
Draft version distributed to conference participants available in PostScript and PDF (4 pages), or read the final version online here.
Abstract: We carried out a one-year project to build a portable speech-to-speech translation system in a new language that could run on a small portable computer. Croatian was chosen as the target language. The resulting system was tested with real users on a trip to Croatia in the spring of 2001. We describe its basic components, the methods we used to build them, initial evaluation results, and related significant observations. This work was done in conjunction with the US Army Chaplain School; chaplains are often the only personnel in a position to communicate with local people over non-military issues such as medical supplies, refugees, etc. This paper thus reports on a realistic instance of rapidly deploying and field-testing a speech-to-speech translator using current technology.

Robert E. Frederking, Alan W Black, Ralf D. Brown, John Moody, and Eric Steinbrecher. "Field Testing the Tongues Speech-to-Speech Machine Translation System". In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002), pp. 160-164. Las Palmas, Gran Canaria, Spain, May 2002.

Robert Frederking, Alan W Black, Ralf Brown, Alexander Rudnicky, John Moody, and Eric Steinbrecher. "Speech Translation on a Tight Budget Without Enough Data". In ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems. Philadelphia, Pennsylvania, July 2002.

Alan W Black, Ralf Brown, Robert Frederking, Kevin Lenzo, John Moody, Alexander Rudnicky, and Rita Singh, and Eric Steinbrecher. "Rapid Development of Speech-to-Speech Translation Systems." In Proceedings of ICSLP-2002. Denver, 2002.
Available in PDF.

---- Miscellaneous ----

Ralf D. Brown, "Improving Embedded Machine Translation with User Interaction", In Proceedings of the 1998 AMTA Workshop on Embedded Machine Translation, Langhorne, Pennsylvania, 28 October 1998. (CiteSeer doi:10.1.1.46.7485)
Available as GZIPped PostScript (due to the large size -- four megs -- of embedded bitmaps)
Abstract: Machine translation (MT) of texts is known to be error-prone, but how can one recover from errors when the translation program is embedded in a larger system in which the user may have no direct access to the translator? One approach is to present the user with various alternative translations from which to select the correct one when the MT program decides on a different, incorrect alternative.

Katharina Probst and Ralf D. Brown, "Using Similarity Scoring to Improve the Bilingual Dictionary for Word Alignment". In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pp. 409-416. Philadelphia, Pennsylvania, July 7-10, 2002.
Available in PostScript and PDF.
Abstract: We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a language an assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dictionary is used.

Violetta Cavalli-Sforza, Ralf D. Brown, Jaime G. Carbonell, Peter J. Jansen, and Jae Dong Kim. "Challenges in Using an Example-Based MT System for a Transnational Digital Government Project". In Proceedings of the Ninth Workshop of the European Association for Machine Translation (EAMT-04), pp. 33-42. University of Malta, April 26-27, 2004.

---- Tools ----

R.D. Brown, FramepaC User's Reference, in preparation. Current draft available in Scribe (380KB) and Postscript (146 pages) formats.

Information Retrieval

Jaime Carbonell, Yiming Yang, Robert Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee. "Translingual Information Retrieval: A Comparative Evaluation". In Proceedings of Fifteenth International Joint Conference on Artificial Intelligence} (IJCAI-97), Vol I, p. 708-715. Nagoya, Japan. 23-29 August 1997. (CiteSeer doi:10.1.1.173.11)
IJCAI-97 Distinguished Paper Award
Available in LaTeX and PostScript format.
Abstract: Translingual information retrieval (TIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TIR methods and reports on comparative TIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation based, and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms other methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) perform relatively well, as does translingual pseudo relevance feedback (PRF). All showed relatively small performance loss between monolingual and translingual versions. Query translation based on a general machine-readable bilingual dictionary -- heretofore the most popular method -- did not match the performance of other, more sophisticated methods. Also, the previous very high LSI results in the literature were disconfirmed by more realistic relevance-based evaluations.

Yiming Yang, Ralf D. Brown, Robert Frederking, Jaime Carbonell, Yibing Geng and Daniel Lee. "Bilingual-corpus Based Approaches to Translingual Information Retrieval". 2nd Workshop on Multilinguality in Software Industry: The AI Contribution (MULSAIC'97). Nagoya, Japan, August 25, 1997.

R.D. Brown, "Corpus-Based Query Translation for Translingual Information Retrieval". Position paper for SIGIR-97 workshop on Cross-Lingual Information Retrieval (Philadelphia, 31 July 1997).
Available in LaTeX and PostScript format. Overhead transparencies from the workshop are also available in Scribe and PostScript format.

Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E. Frederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp. 323-345. (CiteSeer doi:10.1.1.42.8746)
Available in PostScript format (23 pages).

Ralf D. Brown, "Corpus-Driven Splitting of Compound Words", In Proceedings of the Ninth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2002). Keihanna, Japan [photos], March 2002. (CiteSeer doi:10.1.1.13.3087)
Available in PostScript, PDF, and LaTeX formats (10 pages).
Abstract: This paper presents a method for splitting compound words into their constituents based on cognate words in the other language of a parallel corpus. A minor extension to the method allows the decompounding of words which do not have cognates in the other language. By decompounding the training corpus for an Example-Based MT system, the incidence of word alignment failure can be substantially reduced, yielding a modest improvement in performance.

Language Identification

Ralf D. Brown, "Selecting and Weighting N-Grams to Identify 1100 Languages", In Proceedings of Text, Speech, and Dialogue 2013. Plzen, Czech Republic, September 2013.
Available in PDF. Slides from my presentation at TSD.
Abstract: This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score smoothing, and its implementation in an open-source program. When applied to a collection of strings in 1100 languages containing at most 65 characters each, an average classification accuracy of over 99.2% is achieved with smoothing and 98.2% without. Compared to three other open-source language identification programs, the new program is both much more accurate and much faster at classifying short strings given such a large collection of languages.

Ralf D. Brown, "Non-linear Mapping for Improved Identification of 1300+ Languages." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014).
(Preprint available on request)
Abstract: Non-linear mappings of the form P(ngram)^γ and log(1+τ P(ngram))/log(1+τ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four of the five identifiers by 10.6% to 83.8% on the larger corpus and 14.4% to 76.7% on the smaller corpus. The subset corpus and the modified programs are made freely available for download at http://www.cs.cmu.edu/~ralf/langid.html.

Topic Detection and Tracking

James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, Yiming Yang, James Allan, Brian Archibald, Doug Beeferman, Adam Berger, Ralf Brown, Ira Carp, Alex Hauptmann, John Lafferty, Victor Lavrenko, Ron Papka, Thomas Pierce, Jay Ponte, and Mike Scudder. "Topic Detection and Tracking Pilot Study Final Report" (1998). (CiteSeer doi:10.1.1.27.1159)

Jaime G. Carbonell, Yiming Yang, John Lafferty, Ralf Brown, Tom Pierce, and Xin Liu. "CMU Report on TDT-2: Segmentation, Detection, and Tracking", in Proceedings of the 1999 DARPA Broadcast News Conference.
Available in HTML and PostScript.

Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, Thomas Pierce, Brian T. Archibald, and Xin Liu. "Learning Approaches for Detecting and Tracking News Events". In IEEE Intelligent Systems, Volume 14, Number 4, pp 32-43.

Ralf D. Brown, Thomas Pierce, Yiming Yang, and Jaime G. Carbonell. "Link Detection - Results and Analysis", TDT-1999 workshop.
Available in HTML, Postcript, and LaTeX.

Ralf D. Brown, "A Server for Real-Time Event Tracking in News". In Proceedings of HLT 2001: First International Conference on Human Language Technology Research, p. 325-327. San Diego, California, March 18-21, 2001.
Available in PostScript.

Ralf D. Brown, "Dynamic Stopwording for Story Link Detection". In Proceedings of HLT 2002: Second International Conference on Human Language Technology Research, ed. Mitchell Marcus. San Diego, California, March 24-27, 2002, pp. 190-193. (CiteSeer doi:10.1.1.16.4523)
Draft version distributed to conference participants is available in GZipped PostScript (note -- 5.8 megs after decompressing, due to bitmapped graphs).
Abstract: Carnegie Mellon University entered two systems in the Story Link Detection track of the 2001 Topic Detection and Tracking (TDT) evaluation. These systems were one of our systems from the 1999 TDT evaluation, retuned for the new corpus, which had the third-best cost measure; and a new system that adds clustering and dynamically-generated stopwording, which had the best cost measure among all submissions for the default evaluation condition. This paper describes the enhancements which were made and some which were attempted but not used in the evaluation.

Digital Forensics

Ralf D. Brown. "Reconstructing corrupt DEFLATEd files". In Digital Investigation, Volume 8 (2011), pp. S125-S131. (Proceedings of the Eleventh Annual DFRWS Conference, New Orleans, August 1-3, 2011)
DOI: 10.1016/j.diin.2011.05.015
Available in PDF.
Also available: slides from my presentation at DFRWS
Abstract: We present a method by which to determine a synchronzation point within a DEFLATE-compressed bit stream (as used in Zip and gzip archives) for which the beginning is unknown or damaged. Decompressing from the synchronization point forward yields a mixed stream of literal bytes and co-indexed unknown bytes. Language modeling in the form of byte trigrams and word unigrams is then applied to the resulting stream to infer probable replacements for each co-indexed unknown byte. Unique inferences can be made for approximately 30% of the co-indices, permitting reconstruction of approximately 75% of the unknown bytes recovered from the compressed data with accuracy in excess of 90%. The program implementing these techniques is available as open-source software.

Ralf D. Brown. "Finding and Identifying Text in 900+ Languages". In Digital Investigation, Volume 9 (2012), pp. S34-S43. (Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012)
DOI: 10.1016/j.diin.2012.05.004
Available in PDF.
Also available: slides from my presentation at DFRWS.
Abstract: This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, an overall language identification error rate of less than 0.4% is achieved. False alarm rates on random data are 0.34% when filtering thresholds are set for high recall and 0.012% when set for high precision, with corresponding miss rates of 0.002% and 0.009% in running text.

Ralf D. Brown. "Improved recovery and reconstruction of DEFLATEd files". In Digital Investigation, Volume 10 (2013), pp. S21-S29. (Proceedings of the Thirteenth Annual DFRWS Conference, Monterey, CA, August 5-7, 2013)
DOI:
Available in PDF.
Abstract: This paper presents a method for recovering data from files compressed with the DEFLATE algorithm where short segments in the middle of the file have been corrupted, yielding a mix of literal bytes, bytes aligned with literals across the corrupted segment, and co-indexed unknown bytes. An improved reconstruction algorithm based on long byte n-grams increases the proportion of reconstructed bytes by an average of 8.9% absolute across the 21 languages of the Europarl corpus compared to previously-published work, and the proportion of unknown bytes correctly reconstructed by an average of 20.9% absolute, while running in one-twelfth the time on average. Combined with the new recovery method, corrupted segments of 128 to 4096 bytes in the compressed bit-stream result in reconstructed output which differs from the original file by an average of less than twice the number of bytes represented by the corrupted segment. Both new algorithms are implemented in the trainable open-source ZipRec utility program.

Personal Computers

A. Schulman, R.J. Michels, J. Kyle, T. Paterson, D. Maxey, and R. Brown. Undocumented DOS: A Programmer's Guide to Reserved MS-DOS Functions and Data Structures. Addison-Wesley, 1990, 694+xviii pp. ISBN 0-201-57064-5.
(Chinese translation: ISBN 7-302-01071-4; Japanese translation: ISBN 4-89052-629-3)
Errata are available.

Ralf Brown and Jim Kyle. PC Interrupts: A Programmer's Reference to BIOS, DOS, and Third-Party Calls. Addison-Wesley, 1991, 1024 pp. ISBN 0-201-57797-6.
(Chinese translation: ISBNs 957-652-272-2, 957-652-271-4, and 957-652-261-7; Russian translation: ISBNs 5-03-002989-3 and 5-03-002990-7)
Errata are available.

A. Schulman, R. Brown, D. Maxey, R.J. Michels, and J. Kyle. Undocumented DOS: A Programmer's Guide to Reserved MS-DOS Functions and Data Structures, 2nd ed. Addison-Wesley, 1993. ISBN 0-201-63287-X.
Errata are available.

Ralf Brown and Jim Kyle. PC Interrupts: A Programmer's Reference to BIOS, DOS, and Third-Party Calls, 2nd ed. Addison-Wesley, 1994. ISBN 0-201-62485-0.
Errata are available.

Ralf Brown and Jim Kyle. Network Interrupts: A Programmer's Reference to Networking Calls. Addison-Wesley, 1994. ISBN 0-201-62644-6.
Errata are available.

Ralf Brown and Jim Kyle. Uninterrupted Interrupts: A Programmer's CD-ROM Reference to Network APIs and to BIOS, DOS, and Third-Party Calls. Addison-Wesley, 1994. ISBN 0-201-40966-6.

Ralf Brown, "QPI: The QEMM-386 Programming Interface". In Dr. Dobb's Journal, July 1994, pp. 123-131. Miller-Freeman, Inc., San Mateo, California. ISSN 0-38351-16562-8-07.

Ralf Brown. "A Swapping Replacement for the spawn() Family." In D. Burki and R. Ward, ed., MS-DOS System Programming, Third Edition. R&D Publications, 1994. ISBN 0-13-207382-X.

Ralf Brown, "Pentium Model-Specific Registers and What They Reveal". A binary of the High MSR display program (114 bytes) is also available.

Mathematics and Cryptography

Ohoe Kim, John Chollet, Ralf Brown, and David Rauschenberg, "Orthonormal Bases of Symmetry Classes with Computer-Generated Examples", Linear and Multilinear Algebra, 1987, vol 21, pp. 91-106.

A pair of Usenet articles posted on 29apr95: Generalized Feistel Networks and Sliding Feistel Networks.

Miscellaneous

Jim Kyle and R. Brown, eds. Software Developer's Internet Directory IDG Books 1996. ISBN 1-56884-821-8.
Out of Print
Errata are available.

[Home Page] [Books] [Files] [Digital Cameras] [AI] [NLP] [Undocumented] [PGP Keys]
(Last updated 2-Sep-2014)