Next: Resolution of NLP Problems Up: Translation of Pronominal Anaphora Previous: Anaphora Resolution and its

AGIR's Analysis Module

AGIR system architecture is based on the general architecture of an MT system that uses an interlingua strategy. Translation is carried out in two stages: (1) from the source language to the interlingua, and (2) from the interlingua into the target language. Modules for analysis are independent from modules for generation. Although our present work has only studied the Spanish and English languages, our approach can be easily extended to other languages, for exampe, to multilingual system, in the sense that any analysis module can be linked to any generation module.

In AGIR the analysis is carried out using SUPAR (slot unification parser for anaphora resolution) [Ferrández et al., 1999]. SUPAR is a computational system that focuses on anaphora resolution. It can deal with several kinds of anaphora, such as pronominal anaphora, one-anaphora, or definite descriptions⁴. The SUPAR's input is a grammar defined by means of the grammatical formalism SUG (slot unification grammar). A translator that transforms SUG rules into Prolog clauses has been developed. This translator provides a Prolog program that will parse each sentence. SUPAR can perform either a full or a partial parsing of the text with the same parser and grammar. In this study, partial-parsing techniques have been utilized due to the unavoidable incompleteness of the grammar and the use of unrestricted texts (corpora) as input.

The analysis of the source text is carried out in several steps. The first step of the analysis module is the lexical and morphological analysis of the input text. Because of the use of unrestricted texts as input, the system obtains the lexical and morphological information of the texts' lexical units from the output of a part-of-speech (POS) tagger. The word as it appears in the corpus, its lemma, and its POS tag (with morphological information) is supplied for each lexical unit in the corpus.

The next step is the parsing of the text (which includes the lexical and morphological information extracted in the previous stage). Before applying the parsing, the text is split into sentences. The output will be the slot structure (SS) that stores the necessary information⁵ for the subsequent stages.

In the third step, a module of word-sense disambiguation (WSD) is used to obtain a single sense for the different texts' lexical units. The lexical resources, WordNet [Miller et al., 1990] and EuroWordNet [Vossen, 1998], have been used in this stage⁶.

The SS, enriched with the information from the previous steps, will be the input for the next step, in which NLP problems (anaphora, extraposition, ellipsis, etc.) will be treated and solved. In this work, we have focused on the resolution of NLP problems related to pronominal anaphora. After this step, a new slot structure (SS') is obtained. In this new structure, the correct antecedent--chosen from the possible candidates after applying a method based on constraints and preferences [Ferrández et al., 1999]--for each anaphoric expression will be stored along with its morphological and semantic information. The new structure SS' will be the input for the final step of the analysis module.

In the last step, AGIR generates the interlingua representation of the entire text. This is the main difference between AGIR and other MT systems, which process the input text sentence by sentence. The interlingua representation will allow the correct translation of the intersentential and intrasentential pronominal anaphora into the target language. Moreover, AGIR allows the identification of co-reference chains of the text and their subsequent translation into the target language.

The interlingua representation of the input text is based on the clause as the main unit of this representation. Once the text has been split into clauses, AGIR uses a complex feature structure for each clause. This structure is composed of semantic roles and features extracted from the SS of the clause. The notation we have used is based on the one used in KANT interlingua.

It is important to emphasize that the interlingua lexical unit has been represented in AGIR using the word and its correct sense in WordNet. After accessing the ILI (inter-lingual-index) module of EuroWordNet, we will be able to generate the lexical unit into the target language.

Once the semantic roles have been identified, the interlingua representation will store the clauses with their features, the different entities that have appeared in the text and the relations between them (such as anaphoric relations). This representation will be the input for the generation module.

Next: Resolution of NLP Problems Up: Translation of Pronominal Anaphora Previous: Anaphora Resolution and its

Jesus Peral 2002-12-13