## statistical significance

Hello everybody,

I'm a bit stuck on a statistical significance problem. I have the following data: The number of visitors for each of 2 web pages, and a number of conversions for each web page. (A conversion could be the number of visitors who completed the web form.) I would like to be able to say that there is X probability that the ratio of visitors to conversions will remain the same as number of visitors increases. Is this possible? Intuitively it seems impossible, since I could imagine a scenario where the composition of visitors changes drastically due to a link from a popular site. In other words, I don't know how representative my sample is of the population of possible visitors.

I am currently using the chi-square method to determine the probability that my data is not randomly distributed, but would like have a stronger significance measurement.

If anyone can point me in the right direction with a link or the name of an algorithm, I would appreciate it.

Thanks,

jessica
 PhysOrg.com science news on PhysOrg.com >> King Richard III found in 'untidy lozenge-shaped grave'>> Google Drive sports new view and scan enhancements>> Researcher admits mistakes in stem cell study

Thanks for your post, Enuma. It was exactly the type of thing I was looking for. I have spent some time researching these links, and have a few follow-up questions:

 The t-test is the most commonly used method to evaluate the differences in means between two groups.
-- reference from StatSoft

Isn't this the same thing that a Chi-Square measures? It seems to me that the main difference between t-test and chi-square is that chi-test is a non-parametric test, meaning that it is applicable without making as many assumptions about the distribution. Can somebody confirm or deny this?

Now, assuming the above is true, I still haven't found a method that will allow me to say: there is X probability that the ratio of visitors to conversions will remain the same as number of visitors increases. I have had a gut feeling all along that it isn't mathematically possible make such a statement, but I would appreciate anyone who can convince me that I'm wrong.

Thanks,

jessica

Recognitions:
Homework Help

## statistical significance

I agree with your statement that Chi-sq. is a non-parametric test. Like the t-test, Chi-sq. measures the degree of closeness between two distributions. The difference is that in a t-test the two distributions are parametrized by their means (or locations), whereas in a Chi-sq. test the distributions are represented by the number of items falling into each category.

http://www.statsoft.com/textbook/stbasic.html#spearson
http://en.wikipedia.org/wiki/Pearson...hi-square_test

My first interpretation of your 2nd question was in terms of a confidence interval, i.e. you'd like to say something like "with X probability, the expected visitor/conversion ratio is between a and b" or Prob(a < E[#V/#C] < b) = X%. I guess that interpretation is not entirely accurate, wouldn't you say? I guess you'd like to say something like "the bounds a and b are invariant with respect to the sample size, with a certain probability."

One idea is to calculate growth rates over time, then test that they are zero. More formally, you could estimate the regression equation #V/#C = A + B1 n [+ B2 n^2 + ... etc.], where n is the number of visitors, then test B1 = [B2 = ... =] 0 (which will be given by the regression F-statistic). [The square brackets indicate optional terms.] Caution: #V/#C may not have a normal distribution, and regression analysis (e.g. the Ordinary Least Squares package in canned software) assumes that the left-hand side variable is normally distributed. See this on non-normal error terms of a regression.

 the bounds a and b are invariant with respect to the sample size, with a certain probability.
Yes, exactly.

After googling for "statistical unbiasedness" and "statistical consistency", I suspect what I am trying to show is that a and b are statistically consistant with some probability. However, I wasn't able to find any pages that demonstrate how to calculate statistical consistency. Do you know of any pages that might help?

jessica
 Recognitions: Homework Help Science Advisor See, e.g., http://en.wikipedia.org/wiki/Consist...or#Consistency For example, suppose a researcher is taking random sample(s) from a normally distributed population, and calculating the arithmetic average of the sample. The formula with which the arith. aver. is being calculated is a consistent estimator of the population mean, because we can picture that as the sample size approaches the population size (here, infinity), the sample average will approach the true mean. Unbiasedness implies consistency. Also see #3 here. Under "mild" assumptions, the ordinary least squares (see also) estimators of the A and the B(k) coefficients in the regression equation y = A + B(1) n [+ ... + B(K) n^K], where k = 1, ..., K, are unbiased (ergo, consistent) estimators of the true relationship between the LHS and the RHS. BTW, a and b aren't estimators, they are just bounds (constants), so they are not subject to consistency -- they are consistent by definition. What you need to show consistent is the estimator that goes between those bounds, i.e. the average #V/#C.
 After talking this problem through, someone else has suggested the following formula which he says will allow me to caluculate the sample size needed in order to say that my conversion rate will remain the same (with a certian confidence) as time increases. This method seems much simpler than showing that the estimator of the conversion rate is consistent. Can anyone verify that the following reasoning is correct? N = {(t)^2 * (p)(q)}/ (d)^2 (Sorry about the formatting, I'm not sure how to make the superscripts) where I am solving for N, the sample size t is the value for selected alpha level of .005 in each tail = 2.57 (p)(q) is the estimate of variance = .25. (Maximum possible proportion (.5)* {1-maximum possible proportion (.5)} d is the acceptable margin of error for proportion being estimated = .01 which gives (2.57)^2 (.5)(.5)/ (.01)^2 = 4,128.63 It seems to me that the variance estimate was pretty much pulled out of thin air. And I don't know what the formula is called, so I wasn't able to find reference to this formula elsewhere. But hey, what do I know? I'm a programmer, not a statistician. Any help would be greatly appreciated. jessica

Recognitions:
Homework Help
Solving the confidence limits with respect to the sample size is a thoughtful idea. (See especially.) You could say that "as long as my sample exceeds this minimum sample size, the C/V ratio will stay within a lower bound a and an upper bound b, with probability X." The variance (standard deviation) is something that you can calculate as the sum of squared deviations from the average C/V. For each individual visitor V is always 1, and C is either 0 or 1, so you will be averaging a bunch of (0 - your average)^2 terms and another bunch of (1 - your average)^2 terms. In your post you used the binomial variance formula, which is pq = p(1-p), where p is your average C/V. I think that would be the preferred method in this case.

"PRO's": easy to calculate, defensible as "straight out of a textbook formula."

"CON": may not really address visitor heterogeneity over time. E.g., suppose "converters" are positively correlated with the cumulative number of visits (or time), or alternatively, suppose "converters" are negatively correlated with the cumulative visits (or time) -- either way, a large sample size will not help -- you will observe a systematic increase or alternatively a systematic decrease in conversions.

 Quote by reilly Still without worrying about autocorrelations, there's a simple way to do many cases; that is estimate a regression equation; conversions against time and visitor number. If the coefficient of time is not significant, then there's no statistically important variation in conversion rates over time .
I guess one way to write this would be (Equation 1):

conversions = a + b1 time [+ b2 time^2 + ...] + c1 visitors [+ c2 vistr^2 + ...] + d1 time * visitors [+ d2 time * visitors^2 + d3 time^2 * vistr + d4 time^2 vistr^2 + ...]

When the objective is to test the conversion rate, the equation is a little trickier, as the # of visitors is now in the denominator on the left-hand side:

conversions/visitors = a + b1 time [+ b2 time^2 + ...],

so now let's multiply both sides by visitors, so that we get (Equation 2):

conversions = a visitors + b1 time * visitors [+ b2 time^2 visitors + ...]

Notice that Eq. 2 is structurally different from Eq. 1. Notably, the "correct" model does not have an intercept term, and does not include higher powers of the visitors variable. The relevant test statistic is the T-test on "b1*visitors" [joint F-test of the terms "bk * visitors" for k = 1, 2, ...] (which is the total derivative with respect to the powers of time, NOT the partial derivative with respect to time -- but the two are equivalent when time's higher powers are not included and the relevant test is a T-test of "b1*visitors"). In all of these tests, the "visitors" term is conventionally set to the sample average of the number of visitors (e.g. per day).

A further issue is whether to use cumulative conversions and cumulative visits, or C and V per unit time (e.g. per day). Using cumulatives may introduce a co-integration problem, in that two entirely unrelated entities will appear correlated when they accumulate over time. One example is "the cumulative amount of rainfall in Seattle" and "the cumulative amount of garbage collected in New York City." Although the two events are unrelated, they may appear correlated because both accumulate over time (the keyword being "cumulative"). For this reason, a "per unit time" measurement might be preferred.

Keep in mind that a linear regression can do anything that ANOVA does, and more -- although user preferences might differ. See this link for an "okay" primer; although it is not entirely accurate -- e.g., predictor variables (aka the X variables or "the X matrix") in a linear regression do NOT have to be continuous -- the trick is to figure out which binary (aka "indicator" or "dummy" or "zero/one") variables should be included as (or among) the predictors such that they exactly identify the categories being analyzed.

Recognitions:
Homework Help
 Quote by smudge Can anyone verify that the following reasoning is correct? N = {(t)^2 * (p)(q)}/ (d)^2 (Sorry about the formatting, I'm not sure how to make the superscripts) where I am solving for N, the sample size t is the value for selected alpha level of .005 in each tail = 2.57 (p)(q) is the estimate of variance = .25. (Maximum possible proportion (.5)* {1-maximum possible proportion (.5)} d is the acceptable margin of error for proportion being estimated = .01 which gives (2.57)^2 (.5)(.5)/ (.01)^2 = 4,128.63 It seems to me that the variance estimate was pretty much pulled out of thin air. And I don't know what the formula is called, so I wasn't able to find reference to this formula elsewhere.
jessica,

The formula is correct. See the Z-ratio formula here. If you solve Z = d/(sigma/sqrt(N)) for N, you shall get N = sigma^2 Z^2/d^2. (d is the difference "X bar" - "mu" where the Greek letter "mu" rhymes with "hue" and shows the "true" population mean). Now, sigma^2 is the variance and is given by pq = p(1-p) = "X bar" times (1 - "X bar") for the Bernoulli distribution. However, the person whom you have talked to must've been thinking "let us assume the worst that can happen in terms of the variance, i.e. suppose the variance has the largest possible value, and see what kind of N will be needed" and that must be the reason that your formula assumes a variance of 0.25. If you replace it with your actual estimated variance, "X bar" times (1 - "X bar") (which is necessarily < 0.25 for a Bernoulli distribution), your formula will calculate a smaller N.

However, I am not sure whether you should be using a t distribution (the "t" in your formula), or a normal distribution (the "Z" in the "Z formula"). You should be using a "t" distribution (a.k.a. Student's t distribution) when you have an estimated standard deviation in the denominator -- however, in your case you are assuming a constant variance (0.25), and hence a constant s.d. (0.5). In effect, you are multiplying a normally-distributed random variable d with the constant 1/(sigma/sqrt(n)), which produces a new normally-dist. random variable, Z (a.k.a. the standard normal [random] variable, or the standard normal distribution). Which indicates usage of a "standard normal table" rather than a "t table" when you look up the critical value (2.57 in your post).

This whole approach does not address visitor heterogeneity over time. If you are "okay" with the assumption that conversions are essentially random (i.e. visitors are homogeneous over time, and their preferences are time-invariant), then you might go with this formula.

I hope this is responsive and useful.