Python Shuffling in classification problems

EngWiPy · Nov 11, 2017

Hello all,

This is not particular to Python, it is more conceptual related to machine learning algorithms, specifically the nearest neighbor classifiers. So, I have this dataset with m examples, each with n features and one target feature. The dataset is originally ordered such that similar target features are placed contiguously: AAA..A, BBB...B, ... etc. I did 10-fold cross validation on the original dataset and got an accuracy of 0.86. Then I shuffled the dataset and did the same cross validation procedure but got an accuracy of 0.32. My question is: is this expected and why since I compute the average accuracy? Is shuffling the data legitimate in the first place?

Thanks in advance

EngWiPy · Nov 12, 2017

If this question doesn't fit, could anyone recommend me some good and active forums where I can pose my questions related to machine learning and data analytics? Thanks

jedishrfu · Nov 14, 2017

What are you trying to do in a more concrete way?

If you were trying to identify cards in a deck of cards and say organized them by card value: aces, deuces, ... jacks, queens, kings and trained your ANN to recognize them then shuffling would be okay in my opinion because the ANN is being trained to recognize cards and not sequences of cards.

EngWiPy · Nov 14, 2017

I was trying to identify three wheat classes: Canadian, Kama, and Rosa based on a set of measurements of the kernel. The original dataset was organized based on the class like: C, C, ..., C, K, K, ..., K, R, R, ..., R. At first I did the cross validation using the original dataset. But then I shuffled it and did the cross validation again. After some inspection I discovered I had two problems:

1- The indentation of my code wasn't correct, and Python is sensitive to indentation to identify blocks of codes like within a for loop.
2- I didn't normalize the measurements to a common scale, e.g. using z-score normalization.

After correcting the above two mistakes, my code now gives me a 10-fold cross validation performance of 0.86 without shuffling, and 0.96 with random shuffling and 0.94 when I select the test block from the original data for each fold as

Code:

for i in range(fold):
...
    testing = data[i::fold]  ...

Python Shuffling in classification problems

Thread 'Learning Assembly and computer architecture for x86'

Thread 'Learning data structures and algorithms in different programming languages'

Thread 'A Crisis for Newly Minted CompSci Majors -- entry level jobs gone'

Similar threads

Hot Threads

Hackathon ideas?

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Trying To Debug A Python File

Python Complaining About Python

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective