Is this statistically significant?

beamthegreat · Jan 5, 2021

X-axis = past 1-week percent change of the search term "best bitcoin wallet"

Y-axis = future 1-week percent change of the price of bitcoin

From the data, if the number of searches increases by more than 150%, it seems the price of bitcoin almost always increases, however, there are only 10 samples out of 260 of this happening. Likewise, if the number of searches decreases by more than 60%, bitcoin almost always drops.

Could this be statistically significant? Any comments on this?

FactChecker · Jan 5, 2021

No. The X-axis parameter explains very little of the Y-variable parameter variation. An ##R^2## value of 0.013 indicates that there is very little statistical significance in the linear regression. (See https://statisticsbyjim.com/regression/interpret-r-squared-regression/ )

beamthegreat · Jan 5, 2021

FactChecker said:

No. The X-axis parameter explains very little of the Y-variable parameter variation. An ##R^2## value of 0.013 indicates that there is very little statistical significance in the linear regression. (See https://statisticsbyjim.com/regression/interpret-r-squared-regression/ )

I understand that the data as a whole has little to no correlation, but my hypothesis is that during extreme events, X might be predictive of Y. Is there any way to prove/disprove this hypothesis?

beamthegreat · Jan 5, 2021

I've plotted the distribution of X and found that it follows a log-normal distribution. My idea is to exclude values n standard deviations away from the mean and plot the graph of n vs the R-squared value. If the R-squared value increases as I increase n, will this indicate that my hypothesis is correct?

Vanadium 50 · Jan 5, 2021

beamthegreat said:

Is this data statistically significant?

A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.

beamthegreat · Jan 5, 2021

I went ahead and did it. Got a pretty decent correlation and an R-squared of 0.701.

beamthegreat · Jan 5, 2021

Vanadium 50 said:

A good statistical rule of thumb: if you have to ask, the answer is "no".

My personal view is that if you a posteriori throw away 96% of your data, you don't have much there there.

I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?

beamthegreat · Jan 5, 2021

here's what it looks like

FactChecker · Jan 5, 2021

beamthegreat said:

I've used all the data with my method but not sure if that's indicative of anything. Are there any other ways to prove/disprove this hypothesis?

So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me. But I do not see that tendency in the data in your original post.

beamthegreat · Jan 5, 2021

FactChecker said:

So your question is: If the low Y outliers tend to be associated with low X values and the high Y outliers tend to be associated with high X values, can a statistical significance be established. Good question. It doesn't seem unreasonable to me.

Exactly my question.

FactChecker said:

But I do not see that tendency in the data in your original post.

Is there a way to mathematically prove this? Here is what I see (but we all see patterns even from randomness):

Dale · Jan 5, 2021

What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?

beamthegreat · Jan 5, 2021

Dale said:

What happens if you change your lower threshold to 50%, or your upper threshold to 100%, or both? What happens if you take the first half of your data, choose thresholds based on that, and then apply those thresholds in the last half?

I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.

BWV · Jan 5, 2021

This is finance so you focus on the beta t-stat, not the R2 - all the r2 says is that the variable possibly explains a small percentage of the variation - for a trading strategy this is fine. Although 1.3% is too low, no one gets 90% R2 for any nontrivial result in finance - 20% would be significant. Dont massage the data, just look to see if its a significant predictor . Logically, there is no way that internet searches for this term would account for most of the variation in BTC prices, there are too many potential other factors.

Of course the data will look lognormal as it consists of percentage changes and cannot by definition be lower than -100%

Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

Dale · Jan 5, 2021

beamthegreat said:

I've already tried incrementally increasing/decreasing both lower and upper threshold by 0.5 standard deviations starting from the mean, calculated the R-squared value, and plotted the results as seen in post #6 and #8.

I see that now. It isn't very convincing in my opinion. You have to toss a lot of data before you get a decent regression.

A better approach would be to take the first half of the data, choose a threshold based on that, then see how it performs on the last half

BWV · Jan 5, 2021

also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well

beamthegreat · Jan 5, 2021

BWV said:

Your interest, logically, is significant changes in internet search activity and you would expect BTC price changes to be random around no change in search activity. A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

BWV said:

also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Thanks, both comments are very insightful. It is 4AM here, I will do both of your suggestions tomorrow.

madness · Jan 5, 2021

You're in danger of double dipping https://en.m.wikipedia.org/wiki/Circular_analysis

FactChecker · Jan 5, 2021

One problem is that you should evaluate the data at the ends of the range while keeping in mind the random variation in the middle unless you have a legitimate reason to think that the variation changes. In that context, the difference between the lower and upper X-range data is small compared to the random variation of the whole set of data. So there is no statistical significance.

beamthegreat · Jan 6, 2021

BWV said:

A logistic regression that only triggers with a significant change in search might help, or simply using a probit where, x=1 if abs(search activity) >X%, zero otherwise. Be careful that you lose a lot of data the more you cut off x

Got a pretty decent linear correlation using your suggestion. Any thoughts on this?

beamthegreat · Jan 6, 2021

BWV said:

also reversing the variables would be interesting - does last weeks BTC performance correlate to internet search activity this week?

Adding an autocorrelation variable (i.e. last week's BTC return) might be interesting as well

Seems to have zero correlation to me. X-axis = 1W bitcoin price change, Y-axis = future 1W search term change

BWV · Jan 6, 2021

Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?

madness · Jan 6, 2021

BWV said:

The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data.

Or failing that, just go for cross-validation on the present dataset?

beamthegreat · Jan 6, 2021

BWV said:

If you looks at past bubbles, what would a regression of say, internet searches for house flipping told you about changes in house prices over the boom years of 2005-2006 vs the bust years of 2007-2009? Or Internet stocks before or after they imploded in mid 2000?

No idea, but I will investigate this.

BWV said:

Although thinking about this - no reason to suppose the BTC % reaction to searches is symmetric or proportional, I think a logistic or probit on increases in searches vs price change would be a realistic trading signal to test. Just looking at the data, there appears to be a high frequency of positive returns when there is an increase in search activity greater than around 10%. The standard procedure, as others have pointed out, would be to then use this data as a training set on a future set of data. The other question is whether this actually generated higher returns than just pure long BTC, as the time period contains an astronomical rate of return.

My initial idea was not to find a way to outperform a portfolio that is long bitcoin, but rather to confirm whether changes in search are predictive of future prices. If correlation can be proven, then we can discuss all sorts of statistical arbitrages and trading strategies.

Is this statistically significant?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight