Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

A Question about a particular paper on categorical data

  1. Apr 21, 2016 #1
    I am not sure this is the right forum for this -- I have a question about a particular paper:

    http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

    The authors describe 4 heuristics that can be derived from categorical data -- this is in order to map categorical data to numerical. These heuristics are d_m, f_m, n_x, f_x. They also provide two examples y and z and the values of the quantities above computed with respect to dataset in table 3. I am able to lock into their values exactly for d_m and f_m but I cannot reproduce n_x and f_x.

    Could someone read this paper and try to derive these values? I basically take it their equation (3.3) shows summation of reciprocals of arity for A_x set (i.e. the set of mismatching attributes) -- I can't reproduce -5.45 and -7.90.

    Please note I already contacted the authors -- one responded that Dr. Boriah is the person responsible for these calculations but he is apparently not reachable.
     
  2. jcsd
  3. Apr 23, 2016 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    What that paper proposes to do is very interesting, but will understanding it be worth dealing with its problems !

    I can't either.

    It's interesting that ##z = (a_3,b_2,c_{10},d_5) ## has attribute ##a_3## that does not occur in the "reference" data set. I wonder if that example is supposed to emphasize that you can compute the statistics when such a situation comes up.

    The formula (3.3) ##n_x = -\sum_{i \in A_x} \frac{1}{n_i}\ ## is not consistent with the passage in the article that says:
    In the formula, lower airty would produce a "more negative" contribution and the statistic would be lower instead of higher.

    The notation in formula (3.4) ##\ f_x = -\sum_{i \in A_x} ( \frac{1}{z_i} + \frac{1}{y_i}) \ ## doesn't make sense to me because ##z_i## and ##y_i## are values of categories ( like "smooth" and "urban") , not numbers.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Question about a particular paper on categorical data
  1. About Data Description (Replies: 1)

  2. Data Analysis Question (Replies: 1)

Loading...