# Determining the Importance of Certain Data Types (PCA?)

Hello Forum,

My first post....

Im doing a project that extracts certain features from music files. These "feautures" will/may become the inputs to a neural network. I have 12 features in total which will correspond to a maximum of 12 inputs to the neural network.

Essentially I will have 12 columns of data, 1 column of data for each feature. eg 10 music files will produce 10 rows of data for each feature/column. eg Amplitude could be column 1.

Anyway, here comes my maths question. I am not an expert at Maths as Ive only done basic math at university but Im willing to learn and am a fast learner.

--------------------
I want to decide which input features/columns of data are the most important and any relationshipd between them etc. Maybe some sort of classification also but Im not sure?

I have been told that PCA or Principle Components Analysis could be the best way of doing this. I don't have any knowledge of this but a search in Google tells me that this is working out SD and other parameters.

Also, I have been told that classifiers such as Bayesian classifiers could be worth a look.

Im just looking for advice for good maths experts on here. How would you tackle the problem, what techniques would you use? Is it important to look at the relationships between the input data sets?

Last edited:

Dale
Mentor
2020 Award

When you say "most important" what do you mean? In other words, what is the question you are trying to answer or the task you are trying to accomplish with your data?

When you say "most important" what do you mean? In other words, what is the question you are trying to answer or the task you are trying to accomplish with your data?

Thanks.

The prediction out of the neural network will be the quality of the music. The training data for the neural network will be human scores for certain audio files ie they grade the quality of the audio files and give a score. The NN will try to predict what score humans would grade.

The inputs will be from the files used by the humans in the quality grading process. The output of the neural network will be the scores recorded from the humans , for the training of the netwrok.

I want to know 3 things;

1. how do I assess which inputs are most important in giving an accurate prediction of the human scores. I ahve the inputs and expected outputs of the neural network so how do I analyse the inputs to see which ones are most important.

2. Also, which inputs should be removed as they have no importance.

3. Any other ways of improving the accuracy of the system eg classifers that will classify some of the inputs in some way. Im not sure about this. Maybe I could have a different neural network for each class. I think I read that a naive Baysian Classifier can independently decide which inputs to use.?

Thanks for any help.

Dale
Mentor
2020 Award
It sounds to me like you want a multiple regression. That will give you the best linear combination of your features for predicting the scores. You should probably try both a linear regression and a logistic regression.

There are also specific methods for including or excluding your features as predictors.