Looking for the most suitable distance for binary clustering

Click For Summary
SUMMARY

The discussion focuses on selecting the most appropriate distance measure for binary clustering of user login data using a pandas dataset in Python. The user has implemented Hierarchical clustering with Hamming distance and linkage average but is uncertain about its effectiveness. They inquire about alternative distance measures, specifically mentioning the Sokal-Michener distance as a potential option. The conversation emphasizes that the choice of metric should align with the clustering objective.

PREREQUISITES
  • Understanding of Hierarchical clustering techniques
  • Familiarity with distance metrics, specifically Hamming and Sokal-Michener distances
  • Proficiency in Python and the pandas library for data manipulation
  • Knowledge of clustering objectives and their impact on metric selection
NEXT STEPS
  • Research alternative distance measures for binary data, such as Jaccard and Cosine similarity
  • Explore the implementation of clustering algorithms in Python using the scikit-learn library
  • Learn about the impact of clustering objectives on distance metric selection
  • Investigate the performance of different linkage methods in Hierarchical clustering
USEFUL FOR

Data scientists, machine learning practitioners, and anyone involved in user behavior analysis and clustering techniques will benefit from this discussion.

Frank Einstein
Messages
166
Reaction score
1
TL;DR
I have a set of data of people loading into a server and I must find the most adequate distance to cluster them.
Hello everyone.

I have a pandas dataset in python which has n+1 columns and t rows. The first column is a timestamp that goes second by second during a time interval, and the other columns are the names of the people who log in the server. The t rows of the other columns indicate if the person is logged with an "1" and a "0" if the person isn't logged in the exact second.

I have used a Hierarchical clustering with Hamming distance and linkage average.

However, I am not sure if the Hamming distance is the most suitable measure to calculate the clustering between the users, specially after reading this article in which a comparison between 76 distances is defined.

I am not an expert in clustering, so I would like to know what other people think that would be the most adequate distance measure to group the users.

As far as I know, positive and negative matches are important in this case, so the Sokal Michenner distance might be suitable?

Any recomendation is welcome.
Best regards an thanks for reading.
 
Physics news on Phys.org
I think it would help to start by explaining why you are clustering users. A metric's suitability is defined by what your end objective is.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 54 ·
2
Replies
54
Views
6K
  • · Replies 244 ·
9
Replies
244
Views
14K
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 72 ·
3
Replies
72
Views
9K
  • · Replies 31 ·
2
Replies
31
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K