Data transformations: When do you know to stop?

In summary, the data obtained from the experiment was not statistically significant, but the PI wants to continue to transform the data in order to see if it changes anything. However, the more data is transformed, the more it strays from what the raw data truly represents.
  • #1
ProfuselyQuarky
Gold Member
857
588
I'm running raw data and although, visually, the trends are promising, none of it is statistically significant. I was just going to leave it at that because the data was obtained after only 1 year of the experiment and I was just going to say that if treatment continued for a longer period of time, there might be more significant results compared to control. But my PI said to transform the data in order to see if it changes anything. I've never really done that before so I'm working on it but I want to know what the general logic is when it comes to knowing when to continue transforming your data vs just accepting the insignificance and moving on. I.e., I'm slowly forcing the data into becoming more normal since the variances are unequal but the more I do so, the more it strays from what the raw data truly represents, no? I'm not a programmer or data analyst, this is all foreign to me.
 
Last edited:
Technology news on Phys.org
  • #2
Normally, transforming data means to fix up units of measure for consistency, dropping clear outliers from the set, dropping data from botched experiments if any.

What you shouldn't do is drop data when it disagrees with what you're trying to prove. aka cherry picking the data.

Some researchers, realizing that journals won't publish negative results and so researchers will skip those experiments and publish their positive results only. This frustrates other researchers trying to replicate what the first group did but having a poor success rate and then finding the first group had the same problem but didn't include that in their research paper.

What is it that your PI wants you to do exactly?
 
  • #3
jedishrfu said:
Normally, transforming data means to fix up units of measure for consistency, dropping clear outliers from the set, dropping data from botched experiments if any.

What you shouldn't do is drop data when it disagrees with what you're trying to prove. aka cherry picking the data.

Some researchers, realizing that journals won't publish negative results and so researchers will skip those experiments and publish their positive results only. This frustrates other researchers trying to replicate what the first group did but having a poor success rate and then finding the first group had the same problem but didn't include that in their research paper.

What is it that your PI wants you to do exactly?
Basically two of the four indices I tested for were extremely left skewed from a normal distribution. Specifically, they want me to force the data into a proper distribution so that the variance assumption used for anova isn't totally violated and we can allegedly get better p-values. There are certainly a few outliers since the sample size is pretty large but I just wasn't sure what the threshold was for accepting the fact that we weren't going to get the desired numbers (or, maybe not the best numbers but the "better" numbers".

And yeah I'm always afraid of being dishonest with results so I really make a conscious effort to not "cherry pick". Might be worth noting that I was sort of coerced into this project with this different PI and there's been a lot of pressure to successfully produce publish-worthy results before I leave the university at the end of August. I don't think my data is bad though. It's just not as magical as they might have wanted.
 
  • #4
My niece was in a similar fix, her PI was getting promoted to dept head and so her publish was delayed until that happened. Her work basically obliterated 3 papers published earlier saying the effect they were looking for just didn't exist. She was so mad about it as she wanted to contribute in a positive way not it a way where prior papers had to be retracted because of her work.

Her experiment used two genetically altered mice having the gene of interest. However, the first hurdle was that one mouse didn't have the altered gene and so the data got skewed. The second hurdle was the expected effect wasn't seen hence her negative results. But hey the PI got his promotion and the earlier papers were retracted but not to his detriment.
 
  • #5
jedishrfu said:
But hey the PI got his promotion and the earlier papers were retracted but not to his detriment.
Yeah this has been an issue before too... Basically a Michaelis constant was obtained and published, but the authors' conclusions relied on the fact that they conveniently equated Km to their half-velocity constant. Since this only works when the rate of the enzyme-substrate complex is very very small (which it wasn't) all the numbers were wrong and everyone was so mad. The journal refused to retract. It really makes me question peer-review.

I digress, however. I appreciate the advice. I'm going to email him and tell him that nothing I do in Python is going to turn the data into something that he wants it to be. Thank you!
 
  • #6
Have you assessed whether a non-parametric test that doesn't make the assumption of normality is a better choice?
 
  • #7
Jarvis323 said:
Have you assessed whether a non-parametric test that doesn't make the assumption of normality is a better choice?
I did do a principal component analysis (whilst excluding outliers), which to my understanding doesn't have any Gaussian distribution assumption...this also indicated a promising trend but nothing shocking
 
  • #8
ProfuselyQuarky said:
I did do a principal component analysis (whilst excluding outliers), which to my understanding doesn't have any Gaussian distribution assumption...this also indicated a promising trend but nothing shocking
I mean a non-parametric null hypothesis test.
https://www.nature.com/articles/nmeth.2937
 
  • #9
Jarvis323 said:
I mean a non-parametric null hypothesis test.
https://www.nature.com/articles/nmeth.2937
Oh, thanks for this. I'll download the paper and look through it for sure. Again, I'm in no way a pro data analyst and am still trying to learn.
 
  • #10
Applying a normalization function toyour data doesn't make your result less significant. The numerical values attached to your data don't have any special value over any other way you could have chosen to write numbers down. As long as you are doing a computation that can be applied to future data there's no inherent problem.

That said, the more times you do the cycle of manipulate the data, do a test, manipulate the data again, repeat, the less meaningful your final result is. You will eventually luck into a good p value and that doesn't mean anything.
 
  • Like
Likes aaroman and Jarvis323
  • #11
While you want to avoid false positive results, you should also try to avoid false negative results. Both can have a detrimental impact on scientific progress. So you should also avoid reporting that the results are not statistically insignificant if it happens to be that some kind of transformation or alternative method is necessary to measure it, depending on the characteristics of the data. However, you shouldn't just blindly try everything until you get a positive result like Office_Shredder pointed out. You should analyze the data and learn the necessary principles of data science and statistics, and the approaches of you peers, and get feedback from others, to determine a processing pipeline and methodology that is justified based on your data characteristics, and in the paper you should to explain the justification clearly. If the justified approach doesn't yield positive results, then you stop there, I think.
 

What is the purpose of data transformations?

Data transformations are used to manipulate and organize data in a way that makes it more useful for analysis or other purposes. This can include converting data types, cleaning up messy data, or rearranging data for easier interpretation.

When should data transformations be performed?

Data transformations should be performed after data collection and before any analysis or modeling takes place. This ensures that the data is in a usable format and can lead to more accurate and meaningful results.

How do you know when to stop performing data transformations?

The decision to stop performing data transformations is largely based on the quality and usability of the data. Generally, you should stop when the data is in a format that is suitable for the specific analysis or purpose at hand.

What are some common techniques used for data transformations?

Some common techniques for data transformations include data normalization, aggregation, filtering, and imputation. These techniques can help to standardize and clean up data, making it easier to work with and analyze.

What are some potential risks of data transformations?

One potential risk of data transformations is the loss of important information or the introduction of errors. It is important to carefully plan and document any transformations to ensure the integrity and accuracy of the data. Additionally, over-transforming data can lead to misleading or incorrect results.

Similar threads

  • Programming and Computer Science
Replies
11
Views
997
  • Computing and Technology
Replies
3
Views
1K
  • General Math
Replies
7
Views
766
Replies
5
Views
2K
  • Programming and Computer Science
Replies
3
Views
2K
Replies
19
Views
1K
Replies
5
Views
948
Replies
47
Views
4K
  • Programming and Computer Science
Replies
4
Views
15K
Replies
10
Views
482
Back
Top