Question about a particular paper on categorical data

Click For Summary
SUMMARY

The forum discussion centers on the paper "A Survey of Clustering Data Mining Techniques" by Chandola et al., specifically focusing on the heuristics d_m, f_m, n_x, and f_x used to map categorical data to numerical values. The user is unable to reproduce the values for n_x and f_x as outlined in the paper, particularly the calculations involving the arity of mismatching attributes. The formulae presented in the paper, especially equations (3.3) and (3.4), are questioned for their clarity and consistency regarding the relationship between arity and the resulting statistics.

PREREQUISITES
  • Understanding of categorical data and its conversion to numerical values.
  • Familiarity with statistical heuristics and their applications in data analysis.
  • Knowledge of arity in the context of data attributes.
  • Ability to interpret mathematical notation and formulae in research papers.
NEXT STEPS
  • Review the paper "A Survey of Clustering Data Mining Techniques" by Chandola et al. for deeper insights into the heuristics.
  • Explore the concept of arity in categorical data and its implications for statistical analysis.
  • Investigate common issues in reproducing statistical results from research papers.
  • Learn about alternative methods for mapping categorical data to numerical values.
USEFUL FOR

Researchers, data scientists, and statisticians interested in categorical data analysis, as well as anyone looking to understand the complexities of reproducing statistical results from academic papers.

ZPlayer
Messages
35
Reaction score
0
I am not sure this is the right forum for this -- I have a question about a particular paper:

http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

The authors describe 4 heuristics that can be derived from categorical data -- this is in order to map categorical data to numerical. These heuristics are d_m, f_m, n_x, f_x. They also provide two examples y and z and the values of the quantities above computed with respect to dataset in table 3. I am able to lock into their values exactly for d_m and f_m but I cannot reproduce n_x and f_x.

Could someone read this paper and try to derive these values? I basically take it their equation (3.3) shows summation of reciprocals of arity for A_x set (i.e. the set of mismatching attributes) -- I can't reproduce -5.45 and -7.90.

Please note I already contacted the authors -- one responded that Dr. Boriah is the person responsible for these calculations but he is apparently not reachable.
 
Physics news on Phys.org
ZPlayer said:
I am not sure this is the right forum for this -- I have a question about a particular paper:

http://www-users.cs.umn.edu/~sboriah/PDFs/ChandolaCBK2009.pdf

What that paper proposes to do is very interesting, but will understanding it be worth dealing with its problems !

I cannot reproduce n_x and f_x.

I can't either.

It's interesting that ##z = (a_3,b_2,c_{10},d_5) ## has attribute ##a_3## that does not occur in the "reference" data set. I wonder if that example is supposed to emphasize that you can compute the statistics when such a situation comes up.

The formula (3.3) ##n_x = -\sum_{i \in A_x} \frac{1}{n_i}\ ## is not consistent with the passage in the article that says:
The statistic ##n_x## is a function of the arity of the mismatching attributes between an instance and a reference data set. In particular, the value of the statistic is higher when the mismatching attributes have lower arity, i.e. they take fewer values.

In the formula, lower airty would produce a "more negative" contribution and the statistic would be lower instead of higher.

The notation in formula (3.4) ##\ f_x = -\sum_{i \in A_x} ( \frac{1}{z_i} + \frac{1}{y_i}) \ ## doesn't make sense to me because ##z_i## and ##y_i## are values of categories ( like "smooth" and "urban") , not numbers.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
605
  • · Replies 2 ·
Replies
2
Views
2K
Replies
24
Views
8K