Codex Caelestis: Evaluating Activity Inference

July 08, 2004

Evaluating Activity Inference

Here is a small interesting problem.

A user in a smart home performs a sequence of activities and the smart home records a sensor stream that was generated by those activities. The sensors might be motion sensors or RFID tags or infrared beam breaks or whatever. What the user really did is recorded as a sequence of time stamped activities. For discussion we'll call the true description of what happened and when it happened the gold trace.

Now let's suppose there is a hypothetical piece of software which is trying to recreate what happened based on the sensor stream. The inference mechanism might be an HMM or a stochastic solution to a DBN - how you infer the user's actions is not material to the issue I am describing. However, the software must produce a sequence of time stamped activities comparably to the gold trace. Let's call the software inferred trace the black trace.

The problem is this: If you have multiple pieces of software that infer the activities in different ways, you would like to be able to assign a score of how close the black trace is to the gold trace so that you can determine which software did the best job. What is the best way to do this scoring?

One way (method A) is to take both the gold trace and the black trace and slice them into N bins of equal amounts of time. Then for each bin, determine if the black trace and the gold trace agree. The final score is the percentage of bins in which the black trace matches the gold trace.

This method is definitely biased by long activities. An activity which takes a long amount of time is going to be disproportionately represented in the scoring function. In particular you could have a very easy activity to detect (say sleeping) which takes 12 hours. Your activity inferencer might correctly figure out when you are sleeping, but that's all it can do. Using this method, your lame inference engine would get 50% accuracy. This seems to be a bad metric on one hand because it doesn't test an inferencing engine's ability to detect activities. On the other hand, from the perspective of a user of this system it does make sense. Every time they look at the output of the inference engine they know that the engine will be right 50% of the time.

Another method (method B) is to forget the timing information and just translate the gold trace into a sequence of activities. If each activity is represented by a character then the gold trace becomes a string. Now do the same thing with the black trace. Given these two strings you can find a minimum cost edit distance between the two strings. Using the edit mapping you can determine false positives and false negatives and use those metrics to score your traces.

This method is more satisfying from a machine learning perspective because it exercises the inference engine's ability to discriminate among activities. In contrast to method A though it might be entirely unsatisfying to a user. In order to see the degenerate case, imagine that an inference engine correctly guesses the user's current activity for 1 seconds in the middle of the activity and picks a default activity for the rest of the time. The edit distance will be relatively low because there will be one extra activity (the default one) inserted in the black trace for every true activity, but by clock time it will be right a miniscule amount of time.

A third method proposed by Tanzeem is to segment the black trace according to when the true activities (in the gold trace) occurred. For each of these chunks of the black trace integrate over the likelihood of each of the possible activities to determine the time and confidence weighted vote that the inference engine casts for each activity over the length of the activity. If the highest vote is for the correct activity then the black trace is correct, otherwise not.

This method is satisfying because it doesn't bias longer activities against shorter activities, but it does allow the inference engine to rapidly switch it's current best guess while an activity is under way. (Of course you can only switch the instantaneous best guess so often and still achieve overall accuracy, but it is still possible). This behavior would be completely unsatisfying to a user and would be penalized by methods B and A.

So in conclusion and in light of these techniques the desiderata for a good scoring scheme are:

It is not biased toward long activities.
It measures an inference engine's ability to discriminate among different activities.
It penalizes rapid changes in instantaneous prediction.

Posted by djp3 at July 8, 2004 12:23 PM | TrackBack (0)

Comments