Next: Resolution of NLP Problems
Up: Translation of Pronominal Anaphora
Previous: Anaphora Resolution and its
AGIR system architecture is based on the general architecture of an
MT system that uses an interlingua strategy. Translation is
carried out in two stages: (1) from the source language to the
interlingua, and (2) from the interlingua into the target
language. Modules for analysis are independent from modules for
generation. Although our present work has only studied the
Spanish and English languages, our approach can be easily
extended to other languages, for exampe, to multilingual system, in the
sense that any analysis module can be linked to any generation
module.
In AGIR the analysis is carried out using SUPAR (slot
unification parser for anaphora resolution) [Ferrández et al., 1999].
SUPAR is a computational system that focuses on anaphora
resolution. It can deal with several kinds of anaphora, such as
pronominal anaphora, one-anaphora, or definite
descriptions4. The SUPAR's input is a grammar defined by means of the
grammatical formalism SUG (slot unification grammar). A
translator that transforms SUG rules into Prolog clauses has been
developed. This translator provides a Prolog program that will
parse each sentence. SUPAR can perform either a full or a partial
parsing of the text with the same parser and grammar. In this
study, partial-parsing techniques have been utilized due to the
unavoidable incompleteness of the grammar and the use of
unrestricted texts (corpora) as input.
The analysis of the source text is carried out in several steps.
The first step of the analysis module is the lexical and
morphological analysis of the input text. Because of the use of
unrestricted texts as input, the system obtains the lexical and
morphological information of the texts' lexical units from the
output of a part-of-speech (POS) tagger. The word as it appears
in the corpus, its lemma, and its POS tag (with morphological
information) is supplied for each lexical unit in the corpus.
The next step is the parsing of the text (which includes the
lexical and morphological information extracted in the
previous stage). Before applying the parsing, the text is split
into sentences. The output will be the slot structure (SS) that
stores the necessary information5 for the subsequent stages.
In the third step, a module of word-sense disambiguation (WSD) is
used to obtain a single sense for the different texts' lexical
units. The lexical resources, WordNet [Miller et al., 1990] and
EuroWordNet [Vossen, 1998], have been used in this
stage6.
The SS, enriched with the information from the previous steps, will be
the input for the next step, in which NLP problems (anaphora,
extraposition, ellipsis, etc.) will be treated and solved. In
this work, we have focused on the resolution of NLP problems
related to pronominal anaphora. After this step, a new slot
structure (SS') is obtained. In this new structure, the correct
antecedent--chosen from the possible candidates after applying a
method based on constraints and preferences [Ferrández et al., 1999]--for each anaphoric expression will be stored along with its
morphological and semantic information. The new structure SS'
will be the input for the final step of the analysis module.
In the last step, AGIR generates the interlingua representation
of the entire text. This is the main difference between AGIR and
other MT systems, which process the input
text sentence by sentence. The interlingua representation will
allow the correct translation of the intersentential and
intrasentential pronominal anaphora into the target
language. Moreover, AGIR allows the identification of co-reference
chains of the text and their subsequent translation into the
target language.
The interlingua representation of the input text is based on the
clause as the main unit of this representation. Once the text has
been split into clauses, AGIR uses a complex feature structure for
each clause. This structure is composed of semantic roles and
features extracted from the SS of the clause. The notation we
have used is based on the one used in KANT interlingua.
It is important to emphasize that the interlingua lexical unit has
been represented in AGIR using the word and its correct
sense in WordNet. After accessing the ILI (inter-lingual-index)
module of EuroWordNet, we will be able to generate the lexical
unit into the target language.
Once the semantic roles have been identified, the interlingua
representation will store the clauses with their features, the
different entities that have appeared in the text and the
relations between them (such as anaphoric relations). This
representation will be the input for the generation module.
Next: Resolution of NLP Problems
Up: Translation of Pronominal Anaphora
Previous: Anaphora Resolution and its
Jesus Peral
2002-12-13