Looking for the most suitable distance for binary clustering

Frank Einstein · Dec 20, 2023

Hello everyone.

I have a pandas dataset in python which has n+1 columns and t rows. The first column is a timestamp that goes second by second during a time interval, and the other columns are the names of the people who log in the server. The t rows of the other columns indicate if the person is logged with an "1" and a "0" if the person isn't logged in the exact second.

I have used a Hierarchical clustering with Hamming distance and linkage average.

However, I am not sure if the Hamming distance is the most suitable measure to calculate the clustering between the users, specially after reading this article in which a comparison between 76 distances is defined.

I am not an expert in clustering, so I would like to know what other people think that would be the most adequate distance measure to group the users.

As far as I know, positive and negative matches are important in this case, so the Sokal Michenner distance might be suitable?

Any recomendation is welcome.
Best regards an thanks for reading.

Office_Shredder · Jan 1, 2024

I think it would help to start by explaining why you are clustering users. A metric's suitability is defined by what your end objective is.

Looking for the most suitable distance for binary clustering

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight