I Is this statistically significant?

  • I
  • Thread starter Thread starter beamthegreat
  • Start date Start date
  • Tags Tags
    bitcoin
beamthegreat
Messages
116
Reaction score
7
TL;DR Summary
Can statistical significance be established if the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values?
Screen Shot 2021-01-05 at 22.44.43.png


X-axis = past 1-week percent change of the search term "best bitcoin wallet"

Y-axis = future 1-week percent change of the price of bitcoin

From the data, if the number of searches increases by more than 150%, it seems the price of bitcoin almost always increases, however, there are only 10 samples out of 260 of this happening. Likewise, if the number of searches decreases by more than 60%, bitcoin almost always drops.

Could this be statistically significant? Any comments on this?
 
Physics news on Phys.org
FactChecker said:
No. The X-axis parameter explains very little of the Y-variable parameter variation. An ##R^2## value of 0.013 indicates that there is very little statistical significance in the linear regression. (See https://statisticsbyjim.com/regression/interpret-r-squared-regression/ )

I understand that the data as a whole has little to no correlation, but my hypothesis is that during extreme events, X might be predictive of Y. Is there any way to prove/disprove this hypothesis?
 
I've plotted the distribution of X and found that it follows a log-normal distribution. My idea is to exclude values n standard deviations away from the mean and plot the graph of n vs the R-squared value. If the R-squared value increases as I increase n, will this indicate that my hypothesis is correct?
 
beamthegreat said:
Is this data statistically significant?

A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.
 
  • Like
Likes jim mcnamara and gleem
I went ahead and did it. Got a pretty decent correlation and an R-squared of 0.701.
 

Attachments

  • Screen Shot 2021-01-06 at 01.10.53.png
    Screen Shot 2021-01-06 at 01.10.53.png
    10.5 KB · Views: 269
Vanadium 50 said:
A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.

I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?
 


here's what it looks like
 
beamthegreat said:
I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?
So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me. But I do not see that tendency in the data in your original post.
 
  • Like
Likes beamthegreat
  • #10
FactChecker said:
So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me.

Exactly my question.

FactChecker said:
But I do not see that tendency in the data in your original post.

Is there a way to mathematically prove this? Here is what I see (but we all see patterns even from randomness):

Screen Shot 2021-01-05 at 22.44.43.png
 
  • #11
What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?
 
  • #12
Dale said:
What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?

I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.
 
  • #13
This is finance so you focus on the beta t-stat, not the R2 - all the r2 says is that the variable possibly explains a small percentage of the variation - for a trading strategy this is fine. Although 1.3% is too low, no one gets 90% R2 for any nontrivial result in finance - 20% would be significant. Dont massage the data, just look to see if its a significant predictor . Logically, there is no way that internet searches for this term would account for most of the variation in BTC prices, there are too many potential other factors.

Of course the data will look lognormal as it consists of percentage changes and cannot by definition be lower than -100%

Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x
 
  • Like
Likes beamthegreat
  • #14
beamthegreat said:
I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.
I see that now. It isn't very convincing in my opinion. You have to toss a lot of data before you get a decent regression.

A better approach would be to take the first half of the data, choose a threshold based on that, then see how it performs on the last half
 
  • #15
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well
 
  • Like
Likes Vanadium 50, hutchphd, beamthegreat and 1 other person
  • #16
BWV said:
Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

BWV said:
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Thanks, both comments are very insightful. It is 4AM here, I will do both of your suggestions tomorrow.
 
  • #18
One problem is that you should evaluate the data at the ends of the range while keeping in mind the random variation in the middle unless you have a legitimate reason to think that the variation changes. In that context, the difference between the lower and upper X-range data is small compared to the random variation of the whole set of data. So there is no statistical significance.
 
  • Like
Likes Jarvis323
  • #19
BWV said:
A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

Screen Shot 2021-01-06 at 19.01.45.png


Got a pretty decent linear correlation using your suggestion. Any thoughts on this?

Screen Shot 2021-01-06 at 19.09.56.png
 
  • Like
Likes BWV
  • #20
BWV said:
also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well

Screen Shot 2021-01-06 at 19.05.11.png


Seems to have zero correlation to me. X-axis = 1W bitcoin price change, Y-axis = future 1W search term change
 
  • Like
Likes BWV
  • #21
Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?
 
  • #22
BWV said:
The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data.

Or failing that, just go for cross-validation on the present dataset?
 
  • Like
Likes BWV
  • #23
BWV said:
If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?

No idea, but I will investigate this.

BWV said:
Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

My initial idea was not to find a way to outperform a portfolio that is long bitcoin, but rather to confirm whether changes in search are predictive of future prices. If correlation can be proven, then we can discuss all sorts of statistical arbitrages and trading strategies.
 
  • Like
Likes BWV

Similar threads

Back
Top