Data transformations: When do you know to stop?

ProfuselyQuarky · Aug 20, 2022

I'm running raw data and although, visually, the trends are promising, none of it is statistically significant. I was just going to leave it at that because the data was obtained after only 1 year of the experiment and I was just going to say that if treatment continued for a longer period of time, there might be more significant results compared to control. But my PI said to transform the data in order to see if it changes anything. I've never really done that before so I'm working on it but I want to know what the general logic is when it comes to knowing when to continue transforming your data vs just accepting the insignificance and moving on. I.e., I'm slowly forcing the data into becoming more normal since the variances are unequal but the more I do so, the more it strays from what the raw data truly represents, no? I'm not a programmer or data analyst, this is all foreign to me.

jedishrfu · Aug 20, 2022

Normally, transforming data means to fix up units of measure for consistency, dropping clear outliers from the set, dropping data from botched experiments if any.

What you shouldn't do is drop data when it disagrees with what you're trying to prove. aka cherry picking the data.

Some researchers, realizing that journals won't publish negative results and so researchers will skip those experiments and publish their positive results only. This frustrates other researchers trying to replicate what the first group did but having a poor success rate and then finding the first group had the same problem but didn't include that in their research paper.

What is it that your PI wants you to do exactly?

ProfuselyQuarky · Aug 20, 2022

jedishrfu said:

Normally, transforming data means to fix up units of measure for consistency, dropping clear outliers from the set, dropping data from botched experiments if any.

What you shouldn't do is drop data when it disagrees with what you're trying to prove. aka cherry picking the data.

Some researchers, realizing that journals won't publish negative results and so researchers will skip those experiments and publish their positive results only. This frustrates other researchers trying to replicate what the first group did but having a poor success rate and then finding the first group had the same problem but didn't include that in their research paper.

What is it that your PI wants you to do exactly?

Basically two of the four indices I tested for were extremely left skewed from a normal distribution. Specifically, they want me to force the data into a proper distribution so that the variance assumption used for anova isn't totally violated and we can allegedly get better p-values. There are certainly a few outliers since the sample size is pretty large but I just wasn't sure what the threshold was for accepting the fact that we weren't going to get the desired numbers (or, maybe not the best numbers but the "better" numbers".

And yeah I'm always afraid of being dishonest with results so I really make a conscious effort to not "cherry pick". Might be worth noting that I was sort of coerced into this project with this different PI and there's been a lot of pressure to successfully produce publish-worthy results before I leave the university at the end of August. I don't think my data is bad though. It's just not as magical as they might have wanted.

jedishrfu · Aug 20, 2022

My niece was in a similar fix, her PI was getting promoted to dept head and so her publish was delayed until that happened. Her work basically obliterated 3 papers published earlier saying the effect they were looking for just didn't exist. She was so mad about it as she wanted to contribute in a positive way not it a way where prior papers had to be retracted because of her work.

Her experiment used two genetically altered mice having the gene of interest. However, the first hurdle was that one mouse didn't have the altered gene and so the data got skewed. The second hurdle was the expected effect wasn't seen hence her negative results. But hey the PI got his promotion and the earlier papers were retracted but not to his detriment.

ProfuselyQuarky · Aug 20, 2022

jedishrfu said:

But hey the PI got his promotion and the earlier papers were retracted but not to his detriment.

Yeah this has been an issue before too... Basically a Michaelis constant was obtained and published, but the authors' conclusions relied on the fact that they conveniently equated Km to their half-velocity constant. Since this only works when the rate of the enzyme-substrate complex is very very small (which it wasn't) all the numbers were wrong and everyone was so mad. The journal refused to retract. It really makes me question peer-review.

I digress, however. I appreciate the advice. I'm going to email him and tell him that nothing I do in Python is going to turn the data into something that he wants it to be. Thank you!

Jarvis323 · Aug 20, 2022

Have you assessed whether a non-parametric test that doesn't make the assumption of normality is a better choice?

ProfuselyQuarky · Aug 20, 2022

Jarvis323 said:

Have you assessed whether a non-parametric test that doesn't make the assumption of normality is a better choice?

I did do a principal component analysis (whilst excluding outliers), which to my understanding doesn't have any Gaussian distribution assumption...this also indicated a promising trend but nothing shocking

Jarvis323 · Aug 20, 2022

ProfuselyQuarky said:

I did do a principal component analysis (whilst excluding outliers), which to my understanding doesn't have any Gaussian distribution assumption...this also indicated a promising trend but nothing shocking

I mean a non-parametric null hypothesis test.
https://www.nature.com/articles/nmeth.2937

ProfuselyQuarky · Aug 20, 2022

Jarvis323 said:

I mean a non-parametric null hypothesis test.
https://www.nature.com/articles/nmeth.2937

Oh, thanks for this. I'll download the paper and look through it for sure. Again, I'm in no way a pro data analyst and am still trying to learn.

Office_Shredder · Aug 21, 2022

Applying a normalization function toyour data doesn't make your result less significant. The numerical values attached to your data don't have any special value over any other way you could have chosen to write numbers down. As long as you are doing a computation that can be applied to future data there's no inherent problem.

That said, the more times you do the cycle of manipulate the data, do a test, manipulate the data again, repeat, the less meaningful your final result is. You will eventually luck into a good p value and that doesn't mean anything.

Jarvis323 · Aug 21, 2022

While you want to avoid false positive results, you should also try to avoid false negative results. Both can have a detrimental impact on scientific progress. So you should also avoid reporting that the results are not statistically insignificant if it happens to be that some kind of transformation or alternative method is necessary to measure it, depending on the characteristics of the data. However, you shouldn't just blindly try everything until you get a positive result like Office_Shredder pointed out. You should analyze the data and learn the necessary principles of data science and statistics, and the approaches of you peers, and get feedback from others, to determine a processing pipeline and methodology that is justified based on your data characteristics, and in the paper you should to explain the justification clearly. If the justified approach doesn't yield positive results, then you stop there, I think.

Data transformations: When do you know to stop?

What is the purpose of data transformations?

When should data transformations be performed?

How do you know when to stop performing data transformations?

What are some common techniques used for data transformations?

What are some potential risks of data transformations?

Similar threads

Hot Threads

Recent Insights