Is this statistically significant?

  • I
  • Thread starter beamthegreat
  • Start date
  • Tags
    bitcoin
In summary: I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.This is finance so you focus on the beta t-stat, not the R2 - all the r2 says is that the variable possibly explains a small percentage of the variation - for a trading strategy this is fine. Although 1.3% is too low, no one gets 90% R2 for any nontrivial result in finance - 20% would be significant. Dont massage the data, just look to see if its a significant predictor
  • #1
beamthegreat
116
7
TL;DR Summary
Can statistical significance be established if the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values?
Screen Shot 2021-01-05 at 22.44.43.png


X-axis = past 1-week percent change of the search term "best bitcoin wallet"

Y-axis = future 1-week percent change of the price of bitcoin

From the data, if the number of searches increases by more than 150%, it seems the price of bitcoin almost always increases, however, there are only 10 samples out of 260 of this happening. Likewise, if the number of searches decreases by more than 60%, bitcoin almost always drops.

Could this be statistically significant? Any comments on this?
 
Physics news on Phys.org
  • #3
FactChecker said:
No. The X-axis parameter explains very little of the Y-variable parameter variation. An ##R^2## value of 0.013 indicates that there is very little statistical significance in the linear regression. (See https://statisticsbyjim.com/regression/interpret-r-squared-regression/ )

I understand that the data as a whole has little to no correlation, but my hypothesis is that during extreme events, X might be predictive of Y. Is there any way to prove/disprove this hypothesis?
 
  • #4
I've plotted the distribution of X and found that it follows a log-normal distribution. My idea is to exclude values n standard deviations away from the mean and plot the graph of n vs the R-squared value. If the R-squared value increases as I increase n, will this indicate that my hypothesis is correct?
 
  • #5
beamthegreat said:
Is this data statistically significant?

A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.
 
  • Like
Likes jim mcnamara and gleem
  • #6
I went ahead and did it. Got a pretty decent correlation and an R-squared of 0.701.
 

Attachments

  • Screen Shot 2021-01-06 at 01.10.53.png
    Screen Shot 2021-01-06 at 01.10.53.png
    10.5 KB · Views: 206
  • #7
Vanadium 50 said:
A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.

I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?
 
  • #8


here's what it looks like
 
  • #9
beamthegreat said:
I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?
So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me. But I do not see that tendency in the data in your original post.
 
  • Like
Likes beamthegreat
  • #10
FactChecker said:
So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me.

Exactly my question.

FactChecker said:
But I do not see that tendency in the data in your original post.

Is there a way to mathematically prove this? Here is what I see (but we all see patterns even from randomness):

Screen Shot 2021-01-05 at 22.44.43.png
 
  • #11
What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?
 
  • #12
Dale said:
What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?

I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.
 
  • #13
This is finance so you focus on the beta t-stat, not the R2 - all the r2 says is that the variable possibly explains a small percentage of the variation - for a trading strategy this is fine. Although 1.3% is too low, no one gets 90% R2 for any nontrivial result in finance - 20% would be significant. Dont massage the data, just look to see if its a significant predictor . Logically, there is no way that internet searches for this term would account for most of the variation in BTC prices, there are too many potential other factors.

Of course the data will look lognormal as it consists of percentage changes and cannot by definition be lower than -100%

Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x
 
  • Like
Likes beamthegreat
  • #14
beamthegreat said:
I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.
I see that now. It isn't very convincing in my opinion. You have to toss a lot of data before you get a decent regression.

A better approach would be to take the first half of the data, choose a threshold based on that, then see how it performs on the last half
 
  • #15
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well
 
  • Like
Likes Vanadium 50, hutchphd, beamthegreat and 1 other person
  • #16
BWV said:
Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

BWV said:
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Thanks, both comments are very insightful. It is 4AM here, I will do both of your suggestions tomorrow.
 
  • #18
One problem is that you should evaluate the data at the ends of the range while keeping in mind the random variation in the middle unless you have a legitimate reason to think that the variation changes. In that context, the difference between the lower and upper X-range data is small compared to the random variation of the whole set of data. So there is no statistical significance.
 
  • Like
Likes Jarvis323
  • #19
BWV said:
A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

Screen Shot 2021-01-06 at 19.01.45.png


Got a pretty decent linear correlation using your suggestion. Any thoughts on this?

Screen Shot 2021-01-06 at 19.09.56.png
 
  • Like
Likes BWV
  • #20
BWV said:
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well

Screen Shot 2021-01-06 at 19.05.11.png


Seems to have zero correlation to me. X-axis = 1W bitcoin price change, Y-axis = future 1W search term change
 
  • Like
Likes BWV
  • #21
Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?
 
  • #22
BWV said:
The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data.

Or failing that, just go for cross-validation on the present dataset?
 
  • Like
Likes BWV
  • #23
BWV said:
If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?

No idea, but I will investigate this.

BWV said:
Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

My initial idea was not to find a way to outperform a portfolio that is long bitcoin, but rather to confirm whether changes in search are predictive of future prices. If correlation can be proven, then we can discuss all sorts of statistical arbitrages and trading strategies.
 
  • Like
Likes BWV

1. What does "statistically significant" mean?

Statistical significance refers to the likelihood that the results of a study or experiment are not due to chance. It is a measure of the strength of the relationship between two variables or the difference between two groups.

2. How is statistical significance determined?

Statistical significance is typically determined by calculating a p-value, which is the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true. A p-value of less than 0.05 is considered statistically significant.

3. What is the difference between statistical significance and practical significance?

Statistical significance refers to the likelihood of obtaining a result by chance, while practical significance refers to the real-world importance or relevance of the results. A study may be statistically significant, but if the effect size is small, it may not have much practical significance.

4. Can a study be statistically significant but not meaningful?

Yes, a study can be statistically significant but not meaningful. This can happen when the sample size is very large, making even small differences statistically significant. However, these small differences may not have any practical significance in the real world.

5. Why is it important to determine if a result is statistically significant?

Determining if a result is statistically significant is important because it helps to ensure that the observed effects are not due to chance. This allows researchers to draw valid conclusions and make informed decisions based on the results of a study or experiment.

Similar threads

  • General Discussion
Replies
12
Views
855
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
865
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Astronomy and Astrophysics
Replies
3
Views
962
  • Introductory Physics Homework Help
Replies
2
Views
735
  • Classical Physics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Biology and Medical
Replies
10
Views
1K
  • Precalculus Mathematics Homework Help
Replies
1
Views
1K
Replies
17
Views
1K
Back
Top