Rmsd-based clustering, cluster properties

  • Thread starter noplacebos
  • Start date
  • Tags
    Properties
In summary: Expert SummarizerIn summary, the conversation discussed the shape and properties of clusters formed using Root Mean Square Deviation (RMSD) as a distance metric. The clusters may appear spherical or globular due to the RMSD threshold used, but the size and shape will also depend on the distribution and spread of the data points. The pairwise RMSD within a cluster is an average measure and may vary depending on the data. Using Multidimensional Scaling to visualize the clusters may show a more globular shape, but this also depends on the data and number of dimensions. It is important to consider all factors when interpreting results.
  • #1
noplacebos
3
0
Greetings,

I have a quick question that could be trivial, but I am scratching my head for weeks now without being able to find anything concrete in books/papers/web.

I have completed a partitional clustering of a dataset (vectors) using Root Mean Square Deviation as my distance metric. Leaving all other details on the clustering method aside, data points were being assigned to a cluster if their RMSD to the cluster's representative was below a predefined threshold. Let's say that this threshold is 1.

My question is on the resulting clusters' shape, and the general properties of RMSD as a proximity function. Are my clusters spherical (globular) because of this RMSD threshold, do they actually have a radius of 1? Can I assume by default than any two data points within a cluster will have a pairwise RMSD of less than 2 (diameter)?

I am telling myself that as RMSD doesn't exactly reflect the sum of euclidean distances between two vectors, the true cluster's shape may lie in some multidimensional space. If I apply a Multidimensional Scaling down to 3 or 2 dimensions, should I be expecting a globular cluster shape, this time? Does it depend on the nature of the initial vectors (eg. number of parameters)?
Please excuse all these questions but I remain very confused on this matter. Any pointing to a direction would be most helpful.

Thank you very much for your time.
 
Physics news on Phys.org
  • #2




Thank you for your question regarding the shape and properties of clusters formed using Root Mean Square Deviation (RMSD) as a distance metric. I understand your confusion and would be happy to provide some insights and direction on this matter.

Firstly, it is important to note that the shape of the clusters formed using any distance metric is dependent on the nature of the data and the clustering algorithm used. In your case, the clusters may appear spherical or globular due to the RMSD threshold of 1 being used. This means that any data point within a cluster will have an RMSD of less than 1 to the cluster's representative. However, this does not necessarily mean that the clusters have a radius of 1. The size and shape of the clusters will also depend on the distribution and spread of your data points. It is possible that some clusters may have a larger or smaller radius than 1, depending on the variability of the data within that cluster.

As for the pairwise RMSD within a cluster, it is correct to assume that any two data points will have an RMSD of less than 2 as long as they are assigned to the same cluster. However, this does not mean that the clusters are perfectly spherical or that all data points within a cluster are equidistant from each other. The pairwise RMSD is simply a measure of the average distance between data points within a cluster, and the actual distances may vary depending on the distribution of the data.

In terms of using Multidimensional Scaling to visualize the clusters in a lower dimension, it is possible that the clusters may appear more globular in shape. However, this also depends on the nature of your data and the number of dimensions you are reducing to. It is important to note that MDS is a visualization tool and may not accurately represent the true shape of the clusters in higher dimensions.

In conclusion, the shape and properties of your clusters will depend on various factors such as the distance metric, clustering algorithm, and the nature of your data. It is important to carefully consider these factors when interpreting your results. I hope this helps to clarify your confusion and I wish you all the best with your research.


 

1. What is RMSD-based clustering?

RMSD-based clustering is a method used in data analysis to group similar data points together based on their root mean square deviation (RMSD) values. It is commonly used in structural biology to cluster protein structures based on their structural similarity.

2. How does RMSD-based clustering work?

RMSD-based clustering works by calculating the RMSD values between each pair of data points and then grouping them into clusters based on their similarity. The RMSD values are a measure of the structural differences between two data points, so data points with lower RMSD values are considered more similar and are grouped together.

3. What are some properties of clusters in RMSD-based clustering?

Clusters in RMSD-based clustering can vary in size, shape, and density. The size of a cluster is determined by the number of data points it contains, while the shape and density of a cluster are determined by the distribution of data points within it.

4. What are the advantages of using RMSD-based clustering?

RMSD-based clustering can provide a more intuitive and visually appealing way to analyze data compared to traditional methods such as hierarchical clustering. It also allows for the identification of sub-groups or outliers within data sets that may be missed by other clustering methods.

5. What are some potential limitations of RMSD-based clustering?

One limitation of RMSD-based clustering is that it may not be suitable for all types of data, as it is primarily used for structural data. It also requires prior knowledge of the data set, such as the number of clusters to be generated, which may not always be available. Additionally, the results of RMSD-based clustering can be affected by the choice of RMSD cutoff values used.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Replies
2
Views
1K
Replies
12
Views
3K
  • Sci-Fi Writing and World Building
Replies
2
Views
2K
  • Quantum Physics
Replies
12
Views
1K
Replies
2
Views
387
  • Sci-Fi Writing and World Building
Replies
31
Views
2K
  • Feedback and Announcements
Replies
2
Views
1K
  • Programming and Computer Science
Replies
5
Views
1K
  • Special and General Relativity
2
Replies
63
Views
3K
Back
Top