# Regression Question help

• AndreTheGiant
I don't think I have ever seen a notation like this before. Can someone please help me understand it?orIn summary, the problem is asking for the design matrix for a regression model when the data only has k observations when T = n. The first part of the problem is correctly stating that the data would be represented by B1X1 + ... + BnXn in terms of dummy variables, but the second part is asking for the b0 + b1x1 + b2x2 +...+bnxn to be written in terms of the dummy variables.

#### AndreTheGiant

Hi there.

I am having some trouble understanding the full context of this question.

Suppose we have a categorical variable T E (1...n) and we observe k observations for Y when T = n. If a regression model holds:

i) Write down Y in terms of dummy variables X1...Xi

ii) What is the design matrix X

iii) what is b?

So what does it mean when it says there are k observations when T = n? A categorical variable is sort of like a dummy variable right? As in using numbers to represent qualitative measurements such as hair colour etc. So does T only take one value from 1 to n or multiple values?

So if T = n, Y would be represented by B1X1 + ... + BnXn in terms of dummy variables?

how can there only be one b?

Thanks.

AndreTheGiant said:
Hi there.

I am having some trouble understanding the full context of this question.

I am also. Did this come from a textbook or course? If so, what was the subject of the chapter?

Suppose we have a categorical variable T E (1...n)

Did you mean to write "$T \in \{1,2,..n\}$"?

and we observe k observations for Y when T = n.

If you quoted that phrase accurately, I agree it is unclear. Could it have said
"we observe $k_n$ observations for $Y$ when $T = n$"?

If a regression model holds:

i) Write down Y in terms of dummy variables X1...Xi

Dummy variables for categorical data are often defined as variables that take only the value 0 or 1. So perhaps you are being asked to encode the category as a vector $(X_1,X_2,...X_n)$ where $X_c = 1$ when $T = c$ and the rest of the $X_i$ are zero.

What do your text materials say about doing regression with such variables?

The first one is correct that is what i meant. The second one is also correct, i forgot to put the subscript n on the k.

As for the first part. I also made a mistake there. It asks to write down E(Y|T) in terms of X1...Xi dummy variables, not Y, I am not sure if it isthe right way to approach because that is what i thought as well, but wouldn't that matrix just be the design matrix like the second part is asking? I thought I would be writing it like E(Y|T) = b0 + b1x1 + b2x2 +... +bnxn?

AndreTheGiant said:
It is a homework question
You didn't say what the course was. Are you studying ANCOVA?

I can't give you much help on ANCOVA because I haven't looks at such material since the 1980's and I think I only took one course that would have included it. (Of course, I can look up things on the web and refresh or educate my mind, but if you are taking a course in this material, you should have been instructed about the fundamentals of it.

So what does it mean when it says there are k observations when T = n?
We've established that there are $k_n$ observations when $T = n$.
So my guess is that an example of the data is something like this:

$T = 1, k_1 = 4,$ observed Y values: (2.4, 3.2, 9.8, 3.2)
$T = 2, k_2 = 4,$ observed Y values: (3.3, 3.2, 1.7)
$T = 3, k_2 = 6,$ observed Y values ( 2.6, 3.3, 2.0,3.4, 2.2, 9.0)

A categorical variable is sort of like a dummy variable right?
I think they are different concepts. Don't your materials define them? Are you trying to work the problem without referring to your book or lecture notes?

As in using numbers to represent qualitative measurements such as hair colour etc. So does T only take one value from 1 to n or multiple values?

The problem says T (which is a variable) takes values only from 1 to n, but in each different situation , it may take a different one of those values.

Sometimes dummy variables X1,X2,...are only allowed to take the values of 0 or 1.
So when T = 2, we have X1 = 0, X2 = 1, X3 = 0. You'll have to see what your course materials say about this.

This is a one way Anova problem. Ancova isn't covered just yet. Its a regression analysis course. I have my notes but I am still confused and i can't find anything about this on the web really that explains it in the same notation or way my instructor does. I'm really trying to understand it.

You could post questions about statements made by your lecturer using his own notation if you don't understand the lectures.

Searching the web for a few minutes, I found this link:

http://www.biomedware.com/files/documentation/spacestat/Statistics/Regression/Categorical_Data_in_Regression_Analyses.htm [Broken]

The section "Reference cell vs. Effect cell parameterization" suggests that your problem could expect you encode the possible values of T by using n-1 dummy variables and let the case when all dummy variables are all 0 represent the "reference" case. It also explains the relationship between the coefficients $b_i$ in the regression and various means.

iii) what is b?

How does your lecturer use 'b' in other problems? Can 'b' be a matrix. Was there a subscript on 'b' that you left out?

Last edited by a moderator: