Alon's Home Page
Teaching
I'm the main instructor of the
Algorithms for
NLP (11-711) course at the LTI. Algorithms for NLP is an introductory
graduate-level course on the computational properties of natural languages
and the fundamental algorithms for processing natural languages. The course
provides an in-depth presentation of the major algorithms used in NLP,
including Lexical, Morphological, Syntactic and Semantic analysis, with the
primary focus on parsing algorithms and their analysis.
I am also a co-instructor of the
Introduction to
Human Language Technologies (11-682) course, the
Machine
Translation (11-731) course, and the
Grammar
Formalisms (11-722) course.
I supervise the
NLP Lab (11-712)
lab course and co-instruct the
MT Lab (11-732)
.
Research
My main areas of research are Machine Translation (MT) of both text and
speech, and Spoken Language Understanding (SLU). My current most active
research is on developing a general framework for syntax-driven Machine
Translation, applicable to a variety of data scenarios. This framework is
being applied to developing MT prototype systems for languages with limited
amounts of electronic resources. It is also being applied to data-rich
scenarios. One main focus of this work is the development of novel
syntax-based methods for acquisition of the resources that are necessary for
MT. I am also actively working on frameworks for Multi-Engine Machine
Translation (MEMT) and on developing automatic metrics for MT evaluation.
Another current research project is developing parsing approaches for accurate
annotation of Grammatical Relations (GRs)in spoken language data. I have
worked extensively on the design and development of Speech-to-Speech Machine
Translation systems and on robust parsing algorithms for analysis of spoken
language.
Current Research Projects
The AVENUE and LETRAS Projects:
I am a co-PI of the AVENUE and LETRAS projects (funded by NSF). AVENUE is
concerned with the design and rapid development of new Machine Translation
methods for languages for which only scarce resources are available. Our goal
in AVENUE is to apply these new MT methods to minority languages, with a
specific focus on native languages of North and Latin America. We worked on
developing MT systems between Spanish and Mapudungun, a native language spoken
in southern Chile, and have started working on Quechua, a native language
spoken mainly in Peru, Ecuador and Bolivia. The LETRAS project is a follow-on
project to AVENUE, where we are focusing on further development of the
underlying general MT framework and expanding its application to new
languages, including Inupiaq (a native Alaskan language), and native languages
in Bolivia and Brazil. Together with
Jaime Carbonell,
Lori Levin, and a team of several
graduate students, the primary research topics I am working on include: The
design and implementation of a transfer-based MT framework specifically
suitable for learning from data and for rapid prototyping of MT systems (work
with Erik Peterson); Automatic
learning of MT transfer-rules for languages with limited amounts of data
resources (work with Kathrin
Probst); Automatic rule refinement based on feedback from users (work with
Ariadna Font-Llitjos; and
unsupervised learning of morphological inflection classes from monolingual
data (work with Christian Monson).
Select Publications:
The Hebrew-English MT Project:
As a direct follow-up to our AVENUE project work and in collaboration with
Shuly Wintner and his
Computational Linguistics Group
at the University of Haifa
in Israel, we are developing a prototype Hebrew-to-English Machine Translation
system that is based on the framework developed under AVENUE. This work is
being supported by a small grant from the
Caesaria Rothschild Institute at
the University of Haifa.
Select Publications:
The MEMT Project:
I am the lead-PI of a project on a new approach to Multi-Engine Machine
Translation (MEMT). The goal of MEMT is to synthesize the output of multiple
MT systems into a new output that is of higher accuracy than all of the
contributing systems. The new approach invloves two main stages. An explicit
word matcher is first used in order to identify the words that are common
between the MT engine outputs. A decoding algorithm then uses this
information, in conjunction with confidence estimates for the various engines
and a language model in order to score and rank a collection of sentence
hypotheses that are synthetic combinations of words from the various original
engines. The highest scoring sentence hypothesis is selected as the final
output of our system. The project is currently being funded by the DARPA GALE
program, where our MEMT system serves as an essential component for combining
the output from multiple MT engines within the Interoperability Demonstration
system (IOD). The MEMT system has been made available for experimentation
to other research groups. Contact me by email to obtain a copy.
Select Publications:
The METEOR Project:
METEOR is an automatic metric for MT evaluation that we have been
developing at CMU for the past couple of years. METEOR is designed to
address a number of weaknesses in the currently commonly used BLEU and NIST
metrics. The metric heavily relies on an algorithm for finding an optimal
word-to-word matching between a candidate MT translation and a human-produced
reference translation for the same input sentence. METEOR produces normalized
scores (in the range of [0,1]), and has been demonstrated to have
significantly higher-levels of correlation with human judgments of MT quality,
as compared with the more commonly used BLEU and NIST metrics. METEOR is
freely available, and can be downloaded from here
.
Select Publications:
The GRASP Project:
I am PI of the GRASP Project (funded by NSF), where I am working together with
Brian MacWhinney (co-PI) and
Kenji Sagae on developing a
framework for robust high-accuracy parsing of grammatical relations in spoken
language data. Our goal is to automatically annotate the CHILDES database
(a large database of child-parent conversations) with grammatical relations,
in order to support advanced corpus-based research of child language
acquisition.
Select Publications:
Previous Research Projects
I was a co-PI of the Nespole!
and C-STAR speech translation projects
and of the LingWear
and Babylon mobile speech translation projects.
I was the lead PI of AMTEXT project (2003-2005, funded by DoD), a small pilot
project that investigated the feasibility of a rapid development approach to
Machine Translation based on Information Extraction. The approach builds upon
the MT transfer framework developed in the AVENUE project and on
Fei Huang's work on translation of
Named Entities. The main idea is to use a small elicitation corpus of
translated and word-aligned sentences to semi-automatically learn pattern
transfer-rules that can then be used to both extract the information of
interest in the source-language and translate this information into the
target-language.
I was a co-PI of the Clarity project (1997-1999, funded by DoD) on the
automatic detection and classification of the discourse structure of spoken
language.
Other Research Interests
I have a general interest in parsing algorithms for natural and programming
languages and in theoretical problems related to parsing. My own research
has primarily focused on the area of robust analysis and understanding of
spoken language. In my PhD work, I developed GLR*, one of the first robust
parsers for spoken language analysis, and a key component in the earlier
versions of the JANUS speech translation system.
My Students
My Students that have Graduated
Recent Talks and Presentations
Miscellaneous Information
Contact Information
Office:
4615 Newell-Simon Hall
+1-412-268-5655
Fax: +1-412-268-6298
Administrative Assistant:
Mary Jo Bensasi
4527 Newell-Simon Hall
maryjob AT cs DOT cmu DOT edu
+1-412-268-7517
Mailing Address:
Dr. Alon Lavie
Language Technologies Institute
4502 Newell-Simon Hall
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213-3891
Email:
alavie AT cs DOT cmu DOT edu (anti-spam notation)
Home:
5124 Beeler St.
Pittsburgh, PA 15217
+1-412-621-0933