Shuffling in classification problems

In summary: Accuracy = (testing == target) / (total_measures == target)
  • #1
EngWiPy
1,368
61
Hello all,

This is not particular to Python, it is more conceptual related to machine learning algorithms, specifically the nearest neighbor classifiers. So, I have this dataset with m examples, each with n features and one target feature. The dataset is originally ordered such that similar target features are placed contiguously: AAA..A, BBB...B, ... etc. I did 10-fold cross validation on the original dataset and got an accuracy of 0.86. Then I shuffled the dataset and did the same cross validation procedure but got an accuracy of 0.32. My question is: is this expected and why since I compute the average accuracy? Is shuffling the data legitimate in the first place?

Thanks in advance
 
Technology news on Phys.org
  • #2
If this question doesn't fit, could anyone recommend me some good and active forums where I can pose my questions related to machine learning and data analytics? Thanks
 
  • #3
What are you trying to do in a more concrete way?

If you were trying to identify cards in a deck of cards and say organized them by card value: aces, deuces, ... jacks, queens, kings and trained your ANN to recognize them then shuffling would be okay in my opinion because the ANN is being trained to recognize cards and not sequences of cards.
 
  • Like
Likes EngWiPy
  • #4
I was trying to identify three wheat classes: Canadian, Kama, and Rosa based on a set of measurements of the kernel. The original dataset was organized based on the class like: C, C, ..., C, K, K, ..., K, R, R, ..., R. At first I did the cross validation using the original dataset. But then I shuffled it and did the cross validation again. After some inspection I discovered I had two problems:

1- The indentation of my code wasn't correct, and Python is sensitive to indentation to identify blocks of codes like within a for loop.
2- I didn't normalize the measurements to a common scale, e.g. using z-score normalization.

After correcting the above two mistakes, my code now gives me a 10-fold cross validation performance of 0.86 without shuffling, and 0.96 with random shuffling and 0.94 when I select the test block from the original data for each fold as

Code:
for i in range(fold):
...
    testing = data[i::fold]  ...
 

1. What is shuffling in classification problems?

Shuffling in classification problems refers to the process of randomly reordering the data points in a dataset. This is commonly done before training a machine learning model to prevent any bias that may be caused by the original order of the data.

2. Why is shuffling important in classification problems?

Shuffling is important in classification problems because it helps to avoid any potential bias in the data that could affect the performance of the machine learning model. By randomizing the order of the data, the model is forced to learn from a more diverse set of examples, leading to more accurate predictions.

3. When should shuffling be done in a classification problem?

Shuffling should be done before training the machine learning model. This ensures that the model is not biased by the original order of the data and is able to learn from a more diverse set of examples.

4. Can shuffling improve the accuracy of a classification model?

Yes, shuffling can improve the accuracy of a classification model. By randomizing the order of the data, the model is forced to learn from a more diverse set of examples, which can lead to better generalization and improved performance on unseen data.

5. Are there any downsides to shuffling in classification problems?

One downside of shuffling in classification problems is that it can make it more difficult to interpret the results. Since the data is no longer in its original order, it may be harder to identify patterns and understand the underlying relationships between variables. Additionally, shuffling can also be computationally expensive, especially for large datasets.

Similar threads

  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
1
Views
1K
Replies
1
Views
1K
  • Special and General Relativity
2
Replies
56
Views
8K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
3K
Replies
18
Views
5K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
7
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
9
Views
2K
Back
Top