# Bayesian model for hierarchical evaluation

hemanthk
Hi

I have a problem in which I have to search a huge library in order to find documents similar to that of a given query document.

The library is organised into clusters and each cluster contains documents of a particular class.

Given a query document, first we retrieve the most similar clusters and then find the most similar documents among the retrieved clusters.

Now my question is, how do we model this hierarchical evaluation in a bayesian framework ? (Assuming we already have the retrieval methods in place) We just need a framework to probabilistically represent this hierarchical search.

I just need a starting point or an example (a research paper, textbook, etc)...

Any help would be greatly appreciated. Thank you

how do we model this hierarchical evaluation in a bayesian framework ? (Assuming we already have the retrieval methods in place) We just need a framework to probabilistically represent this hierarchical search.

When I think of modeling such a situation, I think of writing a simulation for it. Writing simulations can can involve computing conditional probabilities by Bayes rule, but I wouldn't call that a "Bayesian framework". I don' t know what it would mean to have a Bayesian framework for a simulation.

I think of a Bayesian framework as an approach to problems of statistical estimation or inference.

In the field of artificial intelligence, there are types of learning algorithms that are called Bayesian.

If you already have your procedures in place and you want to simulate them at work, I think you are just asking about how to create a simulation.

If you are trying to optimize your procedures by using a statistical decision method then it makes sense to ask about a Bayesian approach.

hemanthk
Hi Stephen,

Thank for your reply. Actually, its not a simulation that I want to do.
However as you said, I would like to perform my two level similarity evaluation and denote the result as a posterior (conditional probability representation).

I ll explain briefly.

Lets say the query document is D_q , there is a library of documents organised into k clusters (c_1 ... c_k). Let the total number of documents in the library (in all clusters) be N.

Now, we have to find a document in the library (k clusters) most similar to the query D_q.
We do it in two phases:
Phase I: Perform a similarity analysis of D_q with each of the k clusters and find some number of most similar clusters. Let us denote the set of these clusters as C_s.
(Now the search space is boiled down to C_s - a small part of the huge library)
Phase II: Perform a similarity analysis of D_q with each document of each cluster in the set C_s.
Finally, we have a list of N similarities corresponding to all the N documents in the library.
The non-zero similarity values are the ones corresponding to the documents of C_s.
And all the other documents are given zero similarity since they did not even qualify in the first level of analysis.

This is the way we are doing stuff right now.
Now what we want is to represent this evaluation as a similarity posterior over all the documents in the library given the clusters forming the library:

For i=1 to N
p(D_q~i/c_1,c_2,...c_k)=...

The crux of my problem is here: How to decompose this posterior probability expression such that it corresponds to what I do as explained above ? I thought this might be done using bayes rule but am not sure how!

I hope I made myself clear and if I did not, I am ready to explain further more.

Thank you