July 21, 2004
Bleu: a Method for Automatic Evaluation of Machine Translation
In a previous article I discussed some challenges associated with trying to grade to sequences of activities. One of which is the true, or gold trace, and the other is the inferred, or black, trace. In that discussion I suggested that the desiderata for a metric that scores the black trace against the gold trace (so that you can evaluate the quality of different black traces) were: |
- It is not biased toward long activities.
- It measures an inference engine's ability to discriminate among different activities.
- It penalizes rapid changes in instantaneous prediction.
As a result of that discussion Henry referred me to this paper.
This paper was written to describe a method of scoring different machine translations. Instead of a black trace, they looked at a machine translation of a given sentence. Instead of a gold trace, they looked at an expert human translation of the same sentence. Whereas in our application we have activity steps, in their application they have words. Basically it's a straightforward mapping of our problem to theirs.
Their solution was to use a metric which matched successively longer n-grams in the black trace against n-grams in the gold trace with a penalty for black traces which are much shorter than the gold trace. The results of matching n-grams of different lengths are averaged after being scaled for the expected random matching. In the results of their paper they showed that there model discriminated well among human generated black traces as well as subtle differences among machine translations.
Because it is based on a tokenized set of activities it is not biased toward long activities. It does an effective job of scoring a black trace based on sequential orderings (n-gram matches) which means that it exercises an inference engines ability to discriminate among activities and in particular multistep activities. It also effectively penalizes an algorithm that rapidly switches among activity predictions because that would cause n-gram matches to degrade quickly. So it meets our desiderata.
It seems like it would be a good method, but it needs to be slightly modified to make sure that all tokenized matches overlap at least some point in time.
Posted by djp3 at July 21, 2004 5:43 PM | TrackBack (0)