Python Shuffling in classification problems

  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Classification
AI Thread Summary
The discussion centers on the impact of dataset organization on the performance of nearest neighbor classifiers in machine learning. Initially, the dataset was structured with similar target features grouped together, resulting in a high accuracy of 0.86 during 10-fold cross-validation. However, after shuffling the dataset, the accuracy dropped significantly to 0.32, raising questions about the legitimacy of shuffling and its effect on model training. Upon further investigation, the user identified two critical issues: incorrect code indentation, which is crucial in Python, and the lack of normalization of the feature measurements. After addressing these problems, the accuracy improved to 0.86 without shuffling, 0.96 with shuffling, and 0.94 when selecting test blocks from the original data for each fold. This highlights the importance of proper data handling and preprocessing in machine learning to achieve reliable model performance.
EngWiPy
Messages
1,361
Reaction score
61
Hello all,

This is not particular to Python, it is more conceptual related to machine learning algorithms, specifically the nearest neighbor classifiers. So, I have this dataset with m examples, each with n features and one target feature. The dataset is originally ordered such that similar target features are placed contiguously: AAA..A, BBB...B, ... etc. I did 10-fold cross validation on the original dataset and got an accuracy of 0.86. Then I shuffled the dataset and did the same cross validation procedure but got an accuracy of 0.32. My question is: is this expected and why since I compute the average accuracy? Is shuffling the data legitimate in the first place?

Thanks in advance
 
Technology news on Phys.org
If this question doesn't fit, could anyone recommend me some good and active forums where I can pose my questions related to machine learning and data analytics? Thanks
 
What are you trying to do in a more concrete way?

If you were trying to identify cards in a deck of cards and say organized them by card value: aces, deuces, ... jacks, queens, kings and trained your ANN to recognize them then shuffling would be okay in my opinion because the ANN is being trained to recognize cards and not sequences of cards.
 
  • Like
Likes EngWiPy
I was trying to identify three wheat classes: Canadian, Kama, and Rosa based on a set of measurements of the kernel. The original dataset was organized based on the class like: C, C, ..., C, K, K, ..., K, R, R, ..., R. At first I did the cross validation using the original dataset. But then I shuffled it and did the cross validation again. After some inspection I discovered I had two problems:

1- The indentation of my code wasn't correct, and Python is sensitive to indentation to identify blocks of codes like within a for loop.
2- I didn't normalize the measurements to a common scale, e.g. using z-score normalization.

After correcting the above two mistakes, my code now gives me a 10-fold cross validation performance of 0.86 without shuffling, and 0.96 with random shuffling and 0.94 when I select the test block from the original data for each fold as

Code:
for i in range(fold):
...
    testing = data[i::fold]  ...
 
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I have a quick questions. I am going through a book on C programming on my own. Afterwards, I plan to go through something call data structures and algorithms on my own also in C. I also need to learn C++, Matlab and for personal interest Haskell. For the two topic of data structures and algorithms, I understand there are standard ones across all programming languages. After learning it through C, what would be the biggest issue when trying to implement the same data...

Similar threads

Back
Top