Undergrad Looking for advice in clusterization

  • Thread starter Thread starter Frank Einstein
  • Start date Start date
  • Tags Tags
    Time series
Click For Summary
SUMMARY

Frank seeks advice on identifying redundant sensors from a dataset of 1000 rows in an Excel file, using DBSCAN in Python for clustering. He applies StandardScaler for normalization and DBSCAN with parameters eps=15 and min_samples=2. A suggestion is made to utilize a correlation matrix to identify highly correlated variables, allowing for the potential removal of redundant sensors.

PREREQUISITES
  • Understanding of Python programming and libraries such as scikit-learn
  • Familiarity with DBSCAN clustering algorithm
  • Knowledge of data normalization techniques, specifically StandardScaler
  • Ability to compute and interpret correlation matrices
NEXT STEPS
  • Research the implementation of correlation matrices in Python using pandas
  • Explore alternative clustering algorithms like K-Means or Hierarchical Clustering
  • Learn about feature selection techniques to identify and eliminate redundant variables
  • Investigate dimensionality reduction methods such as PCA (Principal Component Analysis)
USEFUL FOR

Data scientists, machine learning practitioners, and analysts working with sensor data or looking to optimize feature selection in datasets.

Frank Einstein
Messages
166
Reaction score
1
TL;DR
I need to know how to cluster data measured at different time instants.
Hello everyone. I have a machine with a series of sensors. All sensors send a signal each minute. I want to know if any of those sensors are redundant. The data is available as an Excel file, where the columns are the variables and the rows are the measurements. I have 1000 rows.

To do this, I have used DBSCAN in Python as

[CODE lang="python" title="Data clusterization"]scaler = StandardScaler()
data_normalized = scaler.fit_transform(data)
data_normalized = data_normalized.T
dbscan = DBSCAN(eps=15, min_samples=2)
clusters = dbscan.fit_predict(data_normalized)[/CODE]

However, I think that there has to be a better way to find relationships between variables (each sensor or columns of the data file).

Could someone please point me towards a methodology more suitable for my goals?
Any answer is appreciated.
Tanks for reading.
Best regards.
Frank.
 
Physics news on Phys.org
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
 
  • Like
Likes WWGD, FactChecker and Frank Einstein
Dale said:
You can just look at the correlation matrix. If two inputs are highly correlated then you can probably drop one.
Thanks. I can calculate them with ease as well.
 
If there are an infinite number of natural numbers, and an infinite number of fractions in between any two natural numbers, and an infinite number of fractions in between any two of those fractions, and an infinite number of fractions in between any two of those fractions, and an infinite number of fractions in between any two of those fractions, and... then that must mean that there are not only infinite infinities, but an infinite number of those infinities. and an infinite number of those...

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
12K
Replies
1
Views
2K
Replies
3
Views
2K
Replies
2
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K