# A Question about a particular paper on categorical data

Tags:
1. Apr 21, 2016

### ZPlayer

I am not sure this is the right forum for this -- I have a question about a particular paper:

http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

The authors describe 4 heuristics that can be derived from categorical data -- this is in order to map categorical data to numerical. These heuristics are d_m, f_m, n_x, f_x. They also provide two examples y and z and the values of the quantities above computed with respect to dataset in table 3. I am able to lock into their values exactly for d_m and f_m but I cannot reproduce n_x and f_x.

Could someone read this paper and try to derive these values? I basically take it their equation (3.3) shows summation of reciprocals of arity for A_x set (i.e. the set of mismatching attributes) -- I can't reproduce -5.45 and -7.90.

Please note I already contacted the authors -- one responded that Dr. Boriah is the person responsible for these calculations but he is apparently not reachable.

2. Apr 23, 2016

### Stephen Tashi

What that paper proposes to do is very interesting, but will understanding it be worth dealing with its problems !

I can't either.

It's interesting that $z = (a_3,b_2,c_{10},d_5)$ has attribute $a_3$ that does not occur in the "reference" data set. I wonder if that example is supposed to emphasize that you can compute the statistics when such a situation comes up.

The formula (3.3) $n_x = -\sum_{i \in A_x} \frac{1}{n_i}\$ is not consistent with the passage in the article that says:
In the formula, lower airty would produce a "more negative" contribution and the statistic would be lower instead of higher.

The notation in formula (3.4) $\ f_x = -\sum_{i \in A_x} ( \frac{1}{z_i} + \frac{1}{y_i}) \$ doesn't make sense to me because $z_i$ and $y_i$ are values of categories ( like "smooth" and "urban") , not numbers.