Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Bayesian model for hierarchical evaluation

  1. Aug 23, 2012 #1

    I have a problem in which I have to search a huge library in order to find documents similar to that of a given query document.

    The library is organised into clusters and each cluster contains documents of a particular class.

    Given a query document, first we retrieve the most similar clusters and then find the most similar documents among the retrieved clusters.

    Now my question is, how do we model this hierarchical evaluation in a bayesian framework ? (Assuming we already have the retrieval methods in place) We just need a framework to probabilistically represent this hierarchical search.

    I just need a starting point or an example (a research paper, textbook, etc)......

    Any help would be greatly appreciated. Thank you
  2. jcsd
  3. Aug 24, 2012 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    When I think of modeling such a situation, I think of writing a simulation for it. Writing simulations can can involve computing conditional probabilities by Bayes rule, but I wouldn't call that a "Bayesian framework". I don' t know what it would mean to have a Bayesian framework for a simulation.

    I think of a Bayesian framework as an approach to problems of statistical estimation or inference.

    In the field of artificial intelligence, there are types of learning algorithms that are called Bayesian.

    If you already have your procedures in place and you want to simulate them at work, I think you are just asking about how to create a simulation.

    If you are trying to optimize your procedures by using a statistical decision method then it makes sense to ask about a Bayesian approach.

    I think you should clarify which situation you're asking about.
  4. Aug 24, 2012 #3
    Hi Stephen,

    Thank for your reply. Actually, its not a simulation that I want to do.
    However as you said, I would like to perform my two level similarity evaluation and denote the result as a posterior (conditional probability representation).

    I ll explain briefly.

    Lets say the query document is D_q , there is a library of documents organised into k clusters (c_1 .... c_k). Let the total number of documents in the library (in all clusters) be N.

    Now, we have to find a document in the library (k clusters) most similar to the query D_q.
    We do it in two phases:
    Phase I: Perform a similarity analysis of D_q with each of the k clusters and find some number of most similar clusters. Let us denote the set of these clusters as C_s.
    (Now the search space is boiled down to C_s - a small part of the huge library)
    Phase II: Perform a similarity analysis of D_q with each document of each cluster in the set C_s.
    Finally, we have a list of N similarities corresponding to all the N documents in the library.
    The non-zero similarity values are the ones corresponding to the documents of C_s.
    And all the other documents are given zero similarity since they did not even qualify in the first level of analysis.

    This is the way we are doing stuff right now.
    Now what we want is to represent this evaluation as a similarity posterior over all the documents in the library given the clusters forming the library:

    For i=1 to N

    The crux of my problem is here: How to decompose this posterior probability expression such that it corresponds to what I do as explained above ???? I thought this might be done using bayes rule but am not sure how!!

    I hope I made myself clear and if I did not, I am ready to explain further more.

    Thank you
  5. Aug 24, 2012 #4

    Stephen Tashi

    User Avatar
    Science Advisor

    It isn't clear what that means.

    Does "D_q ~ i " refer to a yes-or-no random variable that says document D_q is (or is not) similar to document i ? Or does D_q refer to a real number that measures similarity? -or a vector of numbers?

    What event does "c1,c_2,...c_k" denote? I thought the c_i were clusters, not events.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook