Next: Related Work Up: Acquiring Word-Meaning Mappings for Previous: Artificial Data

Active Learning

As indicated in the previous sections, we have built an integrated system for language acquisition that is flexible and useful. However, a major difficulty remains: the construction of training corpora. Though annotating sentences is still arguably less work than building an entire system by hand, the annotation task is also time-consuming and error-prone. Further, the training pairs often contain redundant information. We would like to minimize the amount of annotation required while still maintaining good generalization accuracy. To do this, we turned to methods in active learning. Active learning is a research area in machine learning that features systems that automatically select the most informative examples for annotation and training [Angluin1988,Seung et al.1992], rather than relying on a benevolent teacher or random sampling. The primary goal of active learning is to reduce the number of examples that the system is trained on, while maintaining the accuracy of the acquired information. Active learning systems may construct their own examples, request certain types of examples, or determine which of a set of unsupervised examples are most usefully labeled. The last approach, selective sampling [Cohn et al.1994], is particularly attractive in natural language learning, since there is an abundance of text, and we would like to annotate only the most informative sentences. For many language learning tasks, annotation is particularly time-consuming since it requires specifying a complex output rather than just a category label, so reducing the number of training examples required can greatly increase the utility of learning. In this section, we explore the use of active learning, specifically selective sampling, for lexicon acquisition, and demonstrate that with active learning, fewer examples are required to achieve the same accuracy obtained by training on randomly chosen examples. The basic algorithm for selective sampling is relatively simple. Learning begins with a small pool of annotated examples and a large pool of unannotated examples, and the learner attempts to choose the most informative additional examples for annotation. Existing work in the area has emphasized two approaches, certainty-based methods [Lewis Catlett1994], and committee-based methods [McCallum Nigam1998,Freund et al.1997,Liere Tadepalli1997,Dagan Engelson1995,Cohn et al.1994]; we focus here on the former. In the certainty-based paradigm, a system is trained on a small number of annotated examples to learn an initial classifier. Next, the system examines unannotated examples, and attaches certainties to the predicted annotation of those examples. The k examples with the lowest certainties are then presented to the user for annotation and retraining. Many methods for attaching certainties have been used, but they typically attempt to estimate the probability that a classifier consistent with the prior training data will classify a new example correctly.

**Figure 20:** Selective Sampling Algorithm
$\begin{figure} \hrulefill \begin{tabbing} nn\=nn\=nn\=nn\=nn\=nn\=nn\= \kill ... ...s and all examples annotated so far. \end{tabbing} \hrulefill \end{figure}$

Figure 20 presents abstract pseudocode for certainty-based selective sampling. In an ideal situation, the batch size, k, would be set to one to make the most intelligent decisions in future choices, but for efficiency reasons in retraining batch learning algorithms, it is frequently set higher. Results on a number of classification tasks have demonstrated that this general approach is effective in reducing the need for labeled examples (see citations above). Applying certainty-based sample selection to WOLFIE requires determining the certainty of a complete annotation of a potential new training example, despite the fact that individual learned lexical entries and parsing operators perform only part of the overall annotation task. Therefore, our general approach is to compute certainties for pieces of an example, in our case, phrases, and combine these to obtain an overall certainty for an example. Since lexicon entries contain no explicit uncertainty parameters, we used WOLFIE's heuristic measure to estimate uncertainty. To choose the sentences to be annotated in each round, we first bootstrapped an initial lexicon from a small corpus, keeping track of the heuristic values of the learned items. Then, for each unannotated sentence, we took an average of the heuristic values of the lexicon entries learned for phrases in that sentence, giving a value of zero to unknown words but eliminating from consideration any words that we assume are known in advance, such as database constants. Thus, longer sentences with only a few known phrases would have a lower certainty than shorter sentences with the same number of known phrases; this is desirable since longer sentences will be more informative from a lexicon learning point of view. The sentences with the lowest values were chosen for annotation, added to the bootstrap corpus, and a new lexicon learned. Our technique is summarized in Figure 21.

**Figure 21:** Active Learning for WOLFIE
$\begin{figure}\hrulefill \begin{tabbing} nn\=nn\=nn\=nn\=nn\=nn\=nn\= \kill L... ...$\space of phrases counted in step 1 \end{tabbing} \hrulefill \end{figure}$

To evaluate our technique, we compared active learning to learning from randomly selected examples, again measuring the effectiveness of learned lexicons as background knowledge for CHILL. We again used the (smaller) U.S. Geography corpus, as in the original WOLFIE tests, using the lexicons as background knowledge during parser acquisition (and using the same examples for parser acquisition). For each trial in the following experiments, we first randomly divide the data into a training and test set. Then, n=25 bootstrap examples are randomly selected from the training examples and in each step of active learning, the least certain k=10 examples of the remaining training examples are selected and added to the training set. The result of learning on this set is evaluated after each step. The accuracy of the resulting learned parsers was compared to the accuracy of those learned using randomly chosen examples to learn lexicons and parsers, as in Section 5; in other words, we can think of the k examples in each round as being chosen randomly. Figure 22 shows the accuracy on unseen data of parsers learned using the lexicons learned by WOLFIE when examples are chosen randomly and actively.

**Figure 22:** Using Lexicon Certainty for Active Learning
$\begin{figure}\centerline{\epsfxsize=4.5in \epsfbox{active.ps}} \end{figure}$

There is an annotation savings of around 50 examples by using active learning: the maximum accuracy is reached after 175 examples, versus 225 with random examples. The advantage of using active learning is clear from the beginning, though the differences between the two curves are only statistically significant at 175 training examples. Since we are learning both lexicons and parsers, but only choosing examples based on WOLFIE's certainty measures, the boost could be improved even further if CHILL had a say in the examples chosen. See Thompson (1999) for a description of active learning for CHILL.

Next: Related Work Up: Acquiring Word-Meaning Mappings for Previous: Artificial Data

Cindi Thompson
2003-01-02