Codex Caelestis: May 16, 2004 - May 22, 2004 Archives

May 22, 2004

Continuous Time Bayesian Networks

This paper describes a new way of modeling the changes in a state space over time called Continuous Time Bayesian Networks (CTBNs). It uses a novel graphical model framework which competes with a Dynamic Bayesian Network and is specifically trying to move away from methods which require high frequency discrete updates to "simulate" continuous time. It also attempts to address the correlation problem which Koller describes as the entanglement problem, an effect in DBNs which causes independent random variables to have a dependency due to frequent updates.

At the core of this technique are homogenous Markov processes (HMP) which look like a particular interpretation of Markov Models: The transitions in a Markov process correspond to temporal changes. The transition probabilities do not change over time and are therefore "homogenous". Every state has a self-transition and therefore the distribution over staying in a state is exponential. Instead of a conditional probability table, there is a "conditional intensity matrix" (CIM) which describes the exponential transitions between states.

A CTBN factors the state space much like a Bayesian influence diagram. The image above shows the graphical model which describes the independencies in a particular domain related to drug response. In the CTBN paradigm, cycles are allowed. This is due to the fact that all influence is considered to occur over time however infinitesimal that amount of time may be. So, for example, in the diagram on the left, the cycle at the top of the diagram would be interpreted from a DBN perspective as Eating_N -> FullStomach_N+dt , FullStomach_N -> Hungry_N+dt , and Hungry_N -> Eating_N+dt. Because the influences are factored, each influence arrow A->B has a different CIM for each joint setting of the parents of A.

One of the best motivations for creating this paradigm is because in one calculation it is possible to exactly infer the distribution over the state of the HMP without higher frequency updates. So given a distribution over the state at time T, the distribution over the state at time T+s can be calculated directly in one step regardless of the value of s. This answers one of my questions about how reasoning could be done more efficiently than a DBN which I raised in my review of A Closed-Form Solution for Mapping General Distributions to Minimal PH Distributions

Another question I had about this technique was: What happens if you query the state of the system at a very large time in the future? The authors confirmed my suspicion that the distribution always tends toward the "prior" which in this case is called the "stationary distribution."

The first approximate inference algorithm that was proposed for CTBNs, the "Linearization Method", basically assumed that the exponential distributions were linear and that the values of the random variables didn't change between updates. This is funny because it basically means that this inference method is only valid over short time intervals, exactly the problem that CTBNs were supposed to improve over DBNs! The second inference algorithm, the "Sub-System Method", seemed inconsistent. Sometimes it worked well in the short-term and sometimes it worked better in the long-term. All of the approximate methods did better with frequent updates, whether artificially induced or due to the presence of observations.

So my conclusion is that this is a step in the right direction because it explicitly models continuous time. I like the elegance of the method and the possibility of modeling more complicated distributions than exponentials through combinations of exponentials. I think it's premature though to say that it is clearly a better approach than DBNs. It certainly appeals to a mathematical sense of being the "right" way to reason about time in a Bayesian model, but the experiments don't seem to conclude that CTBNs have fixed the problems that DBNs have.

Posted by djp3 at 6:30 AM | Comments (1) | TrackBack (0)

May 21, 2004

WWW 2004 Day 3

Keynote address by Mozelle Thopson, FTC: Spam is bad, Spyware is bad, Porn is bad, I'm busy fixing it all with the help of industry partners.
Keynote address by James J. Duderstadt, University of Michigan: The Internet is a disruptive technology. It is changing fast. It's affecting Universities. I'm at the forefront of getting universities to "get it."
The most interesting talk that I've seen this entire conference is a discussion of how memes propogate through blog-space. No big surprises: there are a few very influential individuals and lots of moderately influential people. There were a few off-hand comments made about how to leverage these people for various reasons (e.g., marketing) but I think they were poorly thought out. I think the reason why some people are influential is because they are insightful, they have a monopoly on a information channel, or they provide a service. The second person is the one who can be used to market a product, but the first and last are valuable because they will not market a product that they don't believe in. A better idea is to get the first and last to review your product and listen to what they say before marketing something.
More on target for this summer's research was an interesting talk on clustering web pages based on tree-edit distances. Since web news sites ( and many others ) are just a template that is filled out with database information it is possible to find the portions of the HTML tree which contain the interesting information. The interesting idea was that there is an algorithm which will figure on the tree-edit distance of two trees. So two trees can be generalized into a single tree with various wildcard nodes. This suggests a method of merging two RFID traces from the anesthesiologists in an interesting way. The presence of a tree might map well to a hierarchical representation of the task. The tree-merge / tree-edit technique might do a good job of merging different expert traces. One of the questions he received was: Why didn't he use XPath to represent the general tree? Good question. Maybe there are XPath algorithms that will generalize two trees.

Posted by djp3 at 6:50 AM | Comments (2) | TrackBack (0)

May 20, 2004

WWW 2004 Day 2

Ud Manber Keynote Address about Amazon's A9 search engine
- Amazon wants to invent new things in search focussed on e-commerce, that are easy to use, relevant and empirically verifiable as better (particularly as quality is concerned). A9 is a wholly owned subsidiary which is set up to do this.
- "Search inside a book" is one of the first apps. Hundreds of thousands of books scanned then OCR'd in about six months.
- "Search history" is another feature. This also includes "Click History" and a display of what new sites have been found for a given query since the last time you made that search query. Lots of privacy implications here, so Amazon has an unpersonalized site for the people concerned with privacy that still want to use the search engine.

Posted by djp3 at 6:32 AM | Comments (0) | TrackBack (0)

May 19, 2004

WWW 2004 Day 1

Today is "WWW Day" according to Mayor Bloomberg
Tim Berners-Lee Keynote Address:
- The new top-level domain names are a bad idea because they aren't a fundamentally new idea. They are supposedly trying to increase the number of short domain names, but in fact someone who owns www.example.com is more or less forced to buy www.example.mobi to maintain trademark control. So at the end of the day it looked like it was just a cash cow for the domain name registers.
- He used the acronym FOAF: Friend of a Friend. I hadn't heard of that acronym before.
- Good point: The semantic web will take off as people write thin skins that publish things in RDF format as well as current formats like HTML.
- A question was posed about how the semantic web could take off because it commoditizes a lot of what web-sites have done to differentiate themselves and it breaks the advertising business model. Tim's response was that the semantic web is a new thing and therefore old business models won't necessarily work. He also made the point that in a lot of cases, communication benefits both parties and in these situations publishing semantic web data will work because both the provider and the receiver benefit. He cautioned against data providers who corrupt the data with advertising such as data integrators that only provide data on restaurants that have paid to be included. This seems like it won't be a problem as long as it's clear that people are paying to be included in the stream. Just like search engines that allow paid placements. Sometimes you want that, sometimes you don't, but you never want to not know which you are using.
Oren referenced a citation for the Google score that we use in GUIDE to figure out the probability of an object in an activity. He called it point-wise mutual information and said it was introduced by "Turney '01". I'll email him for more details.
"Haystack" is the name of a semantic web browser that looks pretty cool. It uses semantic information to control the views of information. It looks like it might be something big in an early stage.

Posted by djp3 at 6:57 AM | Comments (2) | TrackBack (0)

May 17, 2004

The Panda Cam

I took a trip to the National Zoo in Washington D.C. on Saturday and ran across an interesting approach to privacy in an area with sensors. In this case it was the panda exhibit at the National Zoo which has a panda cam which enables viewers on the Internet to see the pandas in real-time. You can see it here. In addition to the images, they also record sound.

Since the pandas are very close to the public when they are in their enclosure, visitor's images and voices are easily captured by the cameras and broadcast to the world. The zoo decided that the appropriate way to deal with this situation is to allow everyone to be recorded, and to inform them that they are under survellience.

On the one hand this is better than you usually get with survellience cameras, but on the other hand it's not at all clear where the cameras are in the exhibit, where the microphones are, or what the recording quality is. When I checked the website from home I was more bothered by the audio than the video since the video was primarily focussed on the pandas, but the audio picked up whatever.

I think that they should get rid of the audio and make it much more clear where the camera is located.

Posted by djp3 at 7:19 AM | Comments (0) | TrackBack (0)