Looking for the most suitable distance for binary clustering

  • #1
Frank Einstein
170
1
TL;DR Summary
I have a set of data of people loading into a server and I must find the most adequate distance to cluster them.
Hello everyone.

I have a pandas dataset in python which has n+1 columns and t rows. The first column is a timestamp that goes second by second during a time interval, and the other columns are the names of the people who log in the server. The t rows of the other columns indicate if the person is logged with an "1" and a "0" if the person isn't logged in the exact second.

I have used a Hierarchical clustering with Hamming distance and linkage average.

However, I am not sure if the Hamming distance is the most suitable measure to calculate the clustering between the users, specially after reading this article in which a comparison between 76 distances is defined.

I am not an expert in clustering, so I would like to know what other people think that would be the most adequate distance measure to group the users.

As far as I know, positive and negative matches are important in this case, so the Sokal Michenner distance might be suitable?

Any recomendation is welcome.
Best regards an thanks for reading.
 
Physics news on Phys.org
  • #2
I think it would help to start by explaining why you are clustering users. A metric's suitability is defined by what your end objective is.
 

1. What is binary clustering in data analysis?

Binary clustering refers to the process of grouping a set of objects into two clusters based on similarity or distance metrics. This method is used in various fields such as machine learning, statistics, and bioinformatics to simplify data analysis and pattern recognition by dividing data into two distinct groups.

2. How do you determine the most suitable distance for binary clustering?

The most suitable distance for binary clustering depends on the nature of the data and the specific requirements of the analysis. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The choice of distance should reflect the characteristics of the data, such as whether the data is categorical, ordinal, or continuous, and should align with the overall goals of the clustering exercise.

3. What are some common algorithms used for binary clustering?

Common algorithms used for binary clustering include K-means clustering, hierarchical clustering, and DBSCAN. K-means is particularly popular for its simplicity and efficiency in processing large datasets, whereas hierarchical clustering is favored for its ability to provide a multi-level hierarchy of clusters, which is useful for detailed analysis. DBSCAN is utilized for its capability in handling noise and identifying outliers effectively.

4. How does the choice of distance metric affect the results of binary clustering?

The choice of distance metric significantly impacts the results of binary clustering. Different metrics can lead to different cluster formations. For example, Euclidean distance, which measures the straight line distance between points, is sensitive to outliers, whereas Manhattan distance, which sums the absolute differences of their Cartesian coordinates, is more robust in such scenarios. Selecting an inappropriate distance metric can lead to misleading clustering results, thus understanding the data characteristics and the implications of each metric is crucial.

5. What are some challenges in determining the most suitable distance for binary clustering?

Challenges in determining the most suitable distance for binary clustering include handling high-dimensional data, where distance metrics can become less informative (a phenomenon known as the "curse of dimensionality"). Additionally, different types of data (e.g., categorical vs. continuous) may require different distance calculation approaches. Another challenge is the presence of noisy data or outliers, which can distort distance measurements and lead to incorrect clustering. Thus, preprocessing steps such as normalization, dimensionality reduction, and outlier detection are often necessary before applying a clustering algorithm.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
736
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
576
  • Quantum Interpretations and Foundations
2
Replies
54
Views
3K
  • General Math
Replies
5
Views
1K
Replies
5
Views
1K
  • Sci-Fi Writing and World Building
Replies
31
Views
2K
  • Sci-Fi Writing and World Building
Replies
2
Views
1K
  • Introductory Physics Homework Help
Replies
6
Views
3K
  • General Discussion
Replies
4
Views
666
  • Math Proof Training and Practice
2
Replies
38
Views
6K
Back
Top