Looking for the most suitable distance for binary clustering

Frank Einstein · Dec 20, 2023

Hello everyone.

I have a pandas dataset in python which has n+1 columns and t rows. The first column is a timestamp that goes second by second during a time interval, and the other columns are the names of the people who log in the server. The t rows of the other columns indicate if the person is logged with an "1" and a "0" if the person isn't logged in the exact second.

I have used a Hierarchical clustering with Hamming distance and linkage average.

However, I am not sure if the Hamming distance is the most suitable measure to calculate the clustering between the users, specially after reading this article in which a comparison between 76 distances is defined.

I am not an expert in clustering, so I would like to know what other people think that would be the most adequate distance measure to group the users.

As far as I know, positive and negative matches are important in this case, so the Sokal Michenner distance might be suitable?

Any recomendation is welcome.
Best regards an thanks for reading.

Office_Shredder · Jan 1, 2024

I think it would help to start by explaining why you are clustering users. A metric's suitability is defined by what your end objective is.

Looking for the most suitable distance for binary clustering

1. What is binary clustering in data analysis?

2. How do you determine the most suitable distance for binary clustering?

3. What are some common algorithms used for binary clustering?

4. How does the choice of distance metric affect the results of binary clustering?

5. What are some challenges in determining the most suitable distance for binary clustering?

Similar threads

Hot Threads

Recent Insights