Looking for advice in clusterization

  • I
  • Thread starter Frank Einstein
  • Start date
  • Tags
    Time series
  • #1
Frank Einstein
170
1
TL;DR Summary
I need to know how to cluster data measured at different time instants.
Hello everyone. I have a machine with a series of sensors. All sensors send a signal each minute. I want to know if any of those sensors are redundant. The data is available as an Excel file, where the columns are the variables and the rows are the measurements. I have 1000 rows.

To do this, I have used DBSCAN in Python as

Data clusterization:
scaler = StandardScaler()
data_normalized = scaler.fit_transform(data)
data_normalized = data_normalized.T
dbscan = DBSCAN(eps=15, min_samples=2)
clusters = dbscan.fit_predict(data_normalized)

However, I think that there has to be a better way to find relationships between variables (each sensor or columns of the data file).

Could someone please point me towards a methodology more suitable for my goals?
Any answer is appreciated.
Tanks for reading.
Best regards.
Frank.
 
Physics news on Phys.org
  • #2
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
 
  • Like
Likes WWGD, FactChecker and Frank Einstein
  • #3
Dale said:
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
Thanks. I can calculate them with ease as well.
 

1. What is clusterization in data analysis?

Clusterization, or clustering, is the process of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. It's a common technique in data analysis used to discover structure within data, identify patterns, and simplify complex datasets by grouping similar items together.

2. What are the most common algorithms used for clustering?

The most common clustering algorithms include K-means clustering, Hierarchical clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and Gaussian Mixture Models. Each algorithm has its own strengths and is suitable for different types of data and specific requirements of the analysis.

3. How do you determine the right number of clusters in a dataset?

Determining the right number of clusters in a dataset can be challenging and often depends on the method used for clustering. Some common techniques include the Elbow Method, the Silhouette Method, and the Gap Statistic. These methods evaluate the compactness and separation of the clusters formed for different numbers of clusters to suggest the most appropriate number.

4. What are the challenges in clustering large datasets?

Clustering large datasets can be computationally intensive and may require significant memory and processing power, especially for methods like Hierarchical clustering. Scalability issues can arise, and the choice of algorithm becomes crucial. Algorithms like K-means and DBSCAN are generally more scalable and suitable for large datasets.

5. How do you validate the results of clustering?

Validating the results of clustering can be done using internal and external validation indices. Internal methods, like the Silhouette Score, measure the quality of clustering based on how compact and well-separated the clusters are. External methods compare the clustering results with a pre-labeled dataset, if available, using metrics like purity or the Rand index. Both validation approaches help in assessing the effectiveness of the clustering process.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
881
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
520
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
576
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
862
  • Mechanical Engineering
Replies
6
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
1K
  • Computing and Technology
Replies
14
Views
3K
  • Programming and Computer Science
Replies
17
Views
2K
Back
Top