Shuffling in classification problems

  • Context: Python 
  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Classification
Click For Summary

Discussion Overview

The discussion revolves around the impact of shuffling a dataset on the performance of nearest neighbor classifiers in machine learning, particularly in the context of cross-validation accuracy. Participants explore whether shuffling is a legitimate practice and how it affects model performance.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant questions the legitimacy of shuffling the dataset, noting a significant drop in accuracy from 0.86 to 0.32 after shuffling.
  • Another participant suggests that shuffling may be acceptable depending on the specific task, using the example of recognizing cards in a deck, where the order of cards does not affect recognition.
  • A later reply describes a specific case involving three wheat classes, highlighting issues with code indentation and normalization that affected the results. After corrections, the participant reports improved accuracy with shuffling (0.96) compared to the original order (0.86).

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of shuffling datasets. While some suggest it may be acceptable in certain contexts, others raise concerns about its impact on model performance, indicating that the discussion remains unresolved.

Contextual Notes

Limitations include potential dependencies on dataset organization, the importance of normalization, and the sensitivity of code execution to formatting errors. These factors may influence the results but are not fully explored in the discussion.

Who May Find This Useful

Readers interested in machine learning, particularly those working with classification algorithms and dataset preparation techniques, may find this discussion relevant.

EngWiPy
Messages
1,361
Reaction score
61
Hello all,

This is not particular to Python, it is more conceptual related to machine learning algorithms, specifically the nearest neighbor classifiers. So, I have this dataset with m examples, each with n features and one target feature. The dataset is originally ordered such that similar target features are placed contiguously: AAA..A, BBB...B, ... etc. I did 10-fold cross validation on the original dataset and got an accuracy of 0.86. Then I shuffled the dataset and did the same cross validation procedure but got an accuracy of 0.32. My question is: is this expected and why since I compute the average accuracy? Is shuffling the data legitimate in the first place?

Thanks in advance
 
Technology news on Phys.org
If this question doesn't fit, could anyone recommend me some good and active forums where I can pose my questions related to machine learning and data analytics? Thanks
 
What are you trying to do in a more concrete way?

If you were trying to identify cards in a deck of cards and say organized them by card value: aces, deuces, ... jacks, queens, kings and trained your ANN to recognize them then shuffling would be okay in my opinion because the ANN is being trained to recognize cards and not sequences of cards.
 
  • Like
Likes   Reactions: EngWiPy
I was trying to identify three wheat classes: Canadian, Kama, and Rosa based on a set of measurements of the kernel. The original dataset was organized based on the class like: C, C, ..., C, K, K, ..., K, R, R, ..., R. At first I did the cross validation using the original dataset. But then I shuffled it and did the cross validation again. After some inspection I discovered I had two problems:

1- The indentation of my code wasn't correct, and Python is sensitive to indentation to identify blocks of codes like within a for loop.
2- I didn't normalize the measurements to a common scale, e.g. using z-score normalization.

After correcting the above two mistakes, my code now gives me a 10-fold cross validation performance of 0.86 without shuffling, and 0.96 with random shuffling and 0.94 when I select the test block from the original data for each fold as

Code:
for i in range(fold):
...
    testing = data[i::fold]  ...
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
620
Replies
18
Views
7K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
4K