BE

1. Motivation

It has long been a goal of the summarization community to find automatic methods of summary evaluation that produce reliable and stable scores. All automated methods today work by comparing the system summary to one of more reference summaries (ideally, produced by humans). But even focusing just on summary content, not on style, has been problematic. Experience has shown that measuring summary content at the sentence level is not precise enough: generally sentences contain too many bits of information, some of which may be important to include in a summary while others may not be.

There have been two classes of response to this problem: the word-sized and the chunk-sized. In ROUGE (Lin at USC/ISI) and similar systems, the approach is to measure the overlap of each word (or small ngram) with the reference summaries. The problem here is that multi-word units (such as "United States of America") are not treated as single items, thereby skewing the scoring, and that relatively unimportant words (such as "from") count the same as relatively more important ones. Simple efforts to circumvent these problems remain unsatisfactory and crude. Nonetheless, this approach can be automated and can produce evaluation rankings that correlate reasonably with human rankings, as demonstrated in the ROUGE publications.

The other response is to extract longer chunks, namely the strings of contiguous words that express valuable material, from one or more of the reference summaries, and to treat these chunks as the ideal content. Each chunk, regardless of length, is treated as a semantic unit, that is, a unit that expresses one core notion. Each unit is assigned an importance rating depending on how many reference summaries contain it. In recent research, Van Halteren and Teufel (factoid: WAS 2003, EMNLP 2004) in Europe and Nenkova and Passonneau (SCU: HLT/NAACL 2004) at Columbia University in New York have independently investigated this type of approach. Since an element that is included in many reference summaries is obviously more important than one that is included in only a few, this method provides a natural way of scoring each element. The latter two researchers create a 'pyramid' of elements, with the most-frequently-included ones at the top, the next-most one layer down, etc. Evaluating a new summary then becomes a process of comparing its contents to the elements in the pyramid and adding the appropriate score for each one matched. A higher score means the new summary overlapped with more of the reference summary contents and is hence assumed to be a better summary. Preliminary studies show this approach to correlate well with human intuition. The trouble is that creating these chunks is difficult to automate, since they can be of arbitrary size and must incorporate quite different ways of saying the same thing (reference summaries typically say the same thing, or parts of the same thing, in different ways).


Last modified: Tue Apr 12 18:17:56 PDT 2005