Shuffling in classification problems

EngWiPy · Nov 11, 2017

Hello all,

This is not particular to Python, it is more conceptual related to machine learning algorithms, specifically the nearest neighbor classifiers. So, I have this dataset with m examples, each with n features and one target feature. The dataset is originally ordered such that similar target features are placed contiguously: AAA..A, BBB...B, ... etc. I did 10-fold cross validation on the original dataset and got an accuracy of 0.86. Then I shuffled the dataset and did the same cross validation procedure but got an accuracy of 0.32. My question is: is this expected and why since I compute the average accuracy? Is shuffling the data legitimate in the first place?

Thanks in advance

EngWiPy · Nov 12, 2017

If this question doesn't fit, could anyone recommend me some good and active forums where I can pose my questions related to machine learning and data analytics? Thanks

jedishrfu · Nov 14, 2017

What are you trying to do in a more concrete way?

If you were trying to identify cards in a deck of cards and say organized them by card value: aces, deuces, ... jacks, queens, kings and trained your ANN to recognize them then shuffling would be okay in my opinion because the ANN is being trained to recognize cards and not sequences of cards.

EngWiPy · Nov 14, 2017

I was trying to identify three wheat classes: Canadian, Kama, and Rosa based on a set of measurements of the kernel. The original dataset was organized based on the class like: C, C, ..., C, K, K, ..., K, R, R, ..., R. At first I did the cross validation using the original dataset. But then I shuffled it and did the cross validation again. After some inspection I discovered I had two problems:

1- The indentation of my code wasn't correct, and Python is sensitive to indentation to identify blocks of codes like within a for loop.
2- I didn't normalize the measurements to a common scale, e.g. using z-score normalization.

After correcting the above two mistakes, my code now gives me a 10-fold cross validation performance of 0.86 without shuffling, and 0.96 with random shuffling and 0.94 when I select the test block from the original data for each fold as

Code:

for i in range(fold):
...
    testing = data[i::fold]  ...

Shuffling in classification problems

1. What is shuffling in classification problems?

2. Why is shuffling important in classification problems?

3. When should shuffling be done in a classification problem?

4. Can shuffling improve the accuracy of a classification model?

5. Are there any downsides to shuffling in classification problems?

Similar threads

Hot Threads

Recent Insights