# Hypothesis testing technique

I read that I can use the hypothesis testing technique for proportions to decide whether the number of levels for a design variable could be reduced. In literature the following typical test was used:
Ho: pi=0
Ha: pi $\neq$0 where pi is the proportion of class i (level i ) within the
designated design variable.
For my example I have a set of values 546,464,321,112,123 each representing one level.
Can you please explain how to use the test because I am not clear on that and how to decide about confidence levels

Stephen Tashi
I suggest you give the specifics of your problem. The phrase "levels of a design variable" doesn't reveal what mathematical models you are considering. (For example, are you assuming your data is from a mixture of normal distributions?) The concept of "confidence interval" is associated with the task of parameter estimation, not with the task of hypothesis testing. Instead of "confidence level", did you mean "type 1 error"?

Applying statistics to real world problems is highly subjective.

chiro
Hey DrunkenPhD.

I would suggest to frame your problem in terms of a regression problem and then give us the model in terms of its assumptions including the distribution and domain/range of the values (integers, real numbers, categories, etc).

If you do this, it will make everything much more specific and people can give you direct answers to a direct question (building on the advice of Stephen Tashi above).

Also note that regressions have very well defined estimator distribution and test statistics for a wide class of problems and when you have big enough data sets, the central limit theorem tends to take over which means normal distributions provide a good enough approximation.

chiro
This is an example of a generalized linear model and you are basically trying to estimate a proportion which is a value between 0 and 1. To understand this in detail you need to understand Generalized Linear Models and how they are constructed to get estimators and test statistics.

The normal linear model takes a regression model where the Y value is continuous along the real line and the idea is to find either a mean response model where you fit the mean of your Y value (Y is a random variable which is a function of other random variables) to all the other independent X values. If the Y value is continuous then you can do this and you can derive results regarding estimators and all the other statistical jazz.

Generalized linear models work by having what is called a link function. Instead of finding the mean response directly, you find the mean response in terms of some function of it. So while in linear models your g(x) = x (i.e. identity function), the generalized linear models allow g(x) to be an arbitrary function (with specific properties like being invertible and some others).

The reason they use GLM's for proportions is because your mean is not on the whole real line - it's only on the [0,1] section of the real line. Because of this, a generalized linear model needs to be used to get a link function (it is the g(x) used in the paper you just linked to) so that you can use the GLM theory (yes there is a whole theory behind using link functions to get expectation, and variance so that you can construct test statistics and confidence intervals) to get the same sorts of things you get in the non-GLM results.

There is no easier way to understand this properly if you don't understand what is going on in Generalized Linear Models. Any textbook on the subject will go through the results.

If you want the easy answer just remember that normal linear models look at mean response across the whole real line while GLM's don't and deal with estimating parameters in models that do not work with the normal class of linear models. The transformation function is constructed so that it provides the right transformation from the entire real line to whatever is suitable for the parameter of interest.

In your example, you map [-infinity,+infinity] to [0,1] for the proportion parameter and the two distributions used include the probit and the logit families. Look them up for more information.

1 person
So I decided to study in details GLM. Seems very interesting and with lot of applications. Can you please suggest any prerequisites or further readings for this?

chiro
The prerequisites for Generalized Linear Models (GLM) are the appropriate Linear Models theory.

As mentioned above, the GLM generalize the population mean response model in terms of a link function, but the whole linear model framework is based on regression modeling with the theory of estimators that are based on matrix theory and that of maximum likelihood estimation.

The entire linear model theory can be proven in general to have the linear model estimator as the Maximum Likelihood Estimator (MLE) of the parameters. Since MLE is a standard, well studied, and well known technique with a lot of properties, it is very useful for the study of linear models and their relation to statistics.

You can get a better idea of estimators in general by getting a book on Statistical Inference. The graduate books look at the topic in an abstract way and use decision theory to analyze estimators but that is a lot more general. All of the MLE stuff is covered in an undergraduate text on the subject with some more detail covered in the graduate approach.

For normal Linear Models, brush up on your matrix theory and get a book that does a general matrix approach to the proofs. Note that the matrix proof will involve you understanding how to do derivatives of a "matrix" so that the MLE (which uses derivatives) can be done on arbitrary matrices (and this is proved for general classes of models).

Once you understand all of the linear models theory, deriving the MLE estimator, knowing what MLE's are and their other properties for estimation in general, then move on to the GLM theory which looks at estimation in the case where you have a link function and other kinds of constraints. The GLM stuff requires you to solve matrix equations that are iterative instead of closed form like in linear models.

A good way to understand GLM is to look at the Expectation Maximization algorithm (or EM as it is known). Understanding that will help you understand how all the general GLM stuff is done and you can follow all the proofs and what is going on by understanding EM in the context of finding estimators for parameters in GLM-type models.

The theory of GLM also uses the exponential family and understanding that in the context of the above will help you understand how all the theory for actually getting the parameters is done.

When I did this stuff (both in undergraduate and graduate) I mostly used notes delivered in lectures. I'm sure there are lots of books that teach this stuff and have the same kind of content but if you need books, then look at the comments above and get books on Statistical Inference, Linear Models, and Review Basic Matrix Properties And Matrix Calculus with regards to derivatives of certain matrices before looking into the whole GLM framework.

WWGD
Gold Member
Just curious: what statistic/method do we use to determine the explanatory power of a given level, e.g., if we have , say 5 levels for vehicle type, L1, ..,L5 , how do we decide if we can get rid of level Li , i.e., how do we test the explanatory power of a level Li to decide if we can get rid of it and still have a good model? Is it the Wald statistic W ? I suspect we compare the difference of the likelihood with and without L_i (this is the statistic that appears at the top of p. 732 in http://www.fc.up.pt/dmat/engmat/2012...20accident.pdf [Broken] ) ?

Last edited by a moderator:
chiro