Linear regression with discrete independent variable

CopyOfA · Feb 13, 2018

Hey, I have a problem where I have a discrete independent variable (integers spanning 1 through 27) and a continuous dependent variable (50 data points for each independent variable). I am wondering about the best method of regression here. Should I just fit to the mean or median? Is there a way to quantify the fit that takes all the data points into account? Thanks!

mfb · Feb 13, 2018

A regular regression should work if these integers are not arbitrary.

FactChecker · Feb 13, 2018

mfb said:

A regular regression should work if these integers are not arbitrary.

Agree. The integer values should be the measurement of something. 27 should mean 27 times more of something than 1 does. That is how the regression will interpret the integer values. If the integers are not like that, there is still a way to deal with them if they are just indicating different categories. You would want to indicate each category by a separate {0,1} discrete. Only the ones that are a measurement of something should have other integer values indicating the amount of the thing it represents.

CopyOfA · Feb 13, 2018

So just to be clear on the problem. Think of the independent variable as time and I have 27 time stamps. At each time stamp I have 50 data points for the dependent variable (i.e., at t=1, there are 50 data points that are continuous, same at t=2,...,27). This can be imagined as a sequential box plot. My question is: What is the best method to fit a line through all my data? And how should I quantify the goodness of fit?

My thinking was to fit through the medians of the data at each time stamp. But perhaps it should be mean... I’m not sure. Further I don’t know how I can communicate a goodness of fit. Thanks again for the help.

FactChecker · Feb 14, 2018

CopyOfA said:

So just to be clear on the problem. Think of the independent variable as time and I have 27 time stamps.

If the times, t_i, associated with the 27 time stamps are in even, consecutive steps, with t₁ < t₂ < ...<t₂₇, then you can use the integer directly. If the times are not a linear function of the index, then you can make the earliest time t₁ = 0, subtract that time from all the others to calculate t₂ ... t₂₇, and use the t_is as the independent variable rather than their index.

CopyOfA · Feb 14, 2018

Unfortunately, I don't think I'm being clear, and admittedly my title does not convey the actual problem.

My issue isn't on regressing with discrete variables per se, but with the combination of discrete independent variables and continuous dependent variables. Typically in regression one has a set of data points (dependent variables) with one-to-one correspondence to independent variable, {x_i, y_i}. Inn multivariate regression, you will have a vector of independent variables that correspond to a single dependent variable, {x_i, y_i}. However, in my problem, I have a single independent variable that corresponds to a vector of dependent variables, {x_i, y_i}. As an example of the data, consider the attached figure with (standard) normally distributed data plus a linear offset.

So, what is the best way to perform a linear regression through this data, and what is the best way to evaluate said linear fit? Should I simply take the mean or median of the data and then perform linear regression between those two variables? Is there a way to achieve a robust fit through all the data, minimizing the residuals over all the data?

FactChecker · Feb 14, 2018

As long as the independent index has some linear meaning, that does not present a problem for the regression algorithm. It will give you an estimator of the dependent variable based on the dependent variable. You can plug in any continuous value between 1 and 20 and get an estimate.

What you want to avoid is an independent variable that is just a label for a set of data with no real association with time. Like 1=>time 10, 2=> time 0.5, 3=> time 12, etc. That would be bad. As long as the index linearly reflects time, it will be ok. Like 1=>time 10, 2=>time 12.5, 3=>time 15.0, 4=>time 17.5, etc.

If you have all the data, it is better to use that. Otherwise, the mean would be ok if all the sets have the same number of data points. If the sets have different amounts of data, the means should be weighted by the different amounts of data. You might not get very meaningful results from using a median or mode.

CopyOfA · Feb 14, 2018

Yes, this is absolutely the case with my data. The time stamps are in order such that t₁<t₂<...<t_n.

Are there better ways of doing regression that simply regressing on the mean or median of the data at each time stamp? Is the mean or median preferred? How should I evaluate the fit? I'd prefer some way of doing this so as to minimize the residuals to all the data, not just the mean or median.

FactChecker · Feb 14, 2018

CopyOfA said:

Yes, this is absolutely the case with my data. The time stamps are in order such that t₁<t₂<...<t_n.

Are there better ways of doing regression that simply regressing on the mean or median of the data at each time stamp? Is the mean or median preferred? How should I evaluate the fit? I'd prefer some way of doing this so as to minimize the residuals to all the data, not just the mean or median.

Sorry. I was editing post #7 while you responded. See my answer to this in the last paragraph of #7. Use the raw data if you have it.

CopyOfA · Feb 14, 2018

If you have all the data, it is better to use that.

I do have all the data. In what way should I use all the data? Is there some bootstrapping method or performing multiple regressions over randomly chosen data?

FactChecker · Feb 14, 2018

There is not really anything special to do. Just perform a simple linear regression on all the data.

CopyOfA · Feb 14, 2018

OK, but I'm not really even sure how to do this... Consider linear regression on two variables.
$$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}= \mathbf{X}A = \begin{bmatrix} 1 & x_1\\1 & x_2\\ \vdots & \vdots \\ 1 & x_n\end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$$

In this case, the slope and intercept can be found simply by ##\mathbf{X}^{-1}\mathbf{y}## (using pseudo inverse). However, in my case the target matrix is:
$$\begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1m} \\ y_{21} & y_{22} & \cdots & y_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nm} \end{bmatrix}$$

If I performed the same inversion of ##\mathbf{X}##, then this will produce ##m## values for ##a## and ##b##, and presumably each column of ##\mathbf{y}## was regressed separately. This is undesirable since assuming each column of ##\mathbf{y}## is an independent line is not sensible for my data. What I would like is a robust value of ##a## and ##b## that tries to minimize the residuals over all the data. I hope this makes sense.

StoneTemplePython · Feb 14, 2018

CopyOfA said:

OK, but I'm not really even sure how to do this... Consider linear regression on two variables.
$$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}= \mathbf{X}A = \begin{bmatrix} 1 & x_1\\1 & x_2\\ \vdots & \vdots \\ 1 & x_n\end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$$

In this case, the slope and intercept can be found simply by ##\mathbf{X}^{-1}\mathbf{y}## (using pseudo inverse).

If you want to think clearly, you need to write the math clearly.

I would write it as:

$$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}= \mathbf{Xa} = \begin{bmatrix} 1 & x_1\\1 & x_2\\ \vdots & \vdots \\ 1 & x_n\end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$$

note that you use a bold lower case for ##\mathbf y## vector on LHS, so RHS should have same for the vector ##\mathbf a## not ##A##.

and the solution can be shown to be, via the normal equations:

##\mathbf X^T \mathbf{y} = \mathbf X^T \mathbf{Xa} \to \mathbf a = \big(\mathbf X^T \mathbf X\big)^{-1} \mathbf X^T \mathbf{y} ##

equivalently via 'thin' QR factorization
##\mathbf a = \big(\mathbf R\big)^{-1} \mathbf Q^T \mathbf{y} ##

Both of these are about minimizing the L2 norm of the difference between your estimator vector and your ##\mathbf y##
- - - -
It isn't clear what you're trying to estimate exactly, but consider that

##
\mathbf Y = \begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1m} \\ y_{21} & y_{22} & \cdots & y_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nm} \end{bmatrix}##

has ##mn## values, and you can stack them column by column via the vec operator.

##
vec\big(\mathbf Y\big)##

It appears you in effect have 2n data points but want to estimate ##mn## outputs -- assuming that ##m\gt 2##. In general people like to solve this via a minimal length solution.

- - - -

CopyOfA said:

What I would like is a robust value of ##a## and ##b##

Again clear thinking: are you aware of the technical meaning of robust here? The least squares approach is about minimizing an L2 norm, but L1 norms correspond to robust predictions. It's not clear what robust ##a## and ##b## actual means -- maybe you meant stable ##a## and ##b## and not robust here? They are different.

mfb · Feb 14, 2018

You just have many data points that happen to have the same independent value. There is nothing special about that, you can treat them the same was as you would treat different independent values.

FactChecker · Feb 14, 2018

Is there some reason that you are doing the math yourself? There are a lot of utilities to do simple linear regression. I suspect that you may be over-thinking this problem.

Don't worry about the data being clustered into sets for the same time. That does not matter to the regression algorithm. Just apply a regression algorithm to all the sets of (x,y) values, repeating the appropriate x value as often as needed. Having multiple values of y for the same value of x is absolutely normal. In fact, the assumption is that there is a normally distributed random variable added to each y value. So repeats of the same x value will never give the exact same y value. The regression algorithm was made to handle that situation.

CopyOfA · Feb 14, 2018

StoneTemplePython said:

If you want to think clearly, you need to write the math clearly.

Fair enough.

What I'm trying to do is estimate a linear fit through all of my data. Fitting through the mean is appropriate for normally distributed data. That is, if we assume ##\mathbf{y}_i## (##i##th row of the matrix ##\mathbf{Y}##) is distributed ##~N(\mu_y(x_i),\sigma_y)##, then we can just say that ##\mu_y(x_i) = a + bx_i##. Finding these fit coefficients would be done according to the process laid out earlier:
$$\mathbf{a} = \begin{bmatrix}a\\b\end{bmatrix}=\mathbf{R}^{−1}\mathbf{Q}^{T}\mathbf{μ}_y$$
where ##\mathbf{μ}_y = \begin{bmatrix}\mu_y(x_1) & \mu_y(x_2) & \cdots & \mu_y(x_n)\end{bmatrix}^T## and ##\mathbf{X}=\mathbf{QR}## is the ##\mathit{QR}## factorization of ##\mathbf{X}##. This would be the best-fit line through the means of the data, and I suspect this also minimizes the L2 norm over all the data, since we are assuming a symmetric distribution at each ##x_i## (though I've not attempted to work through the math).

In the case of the data that I am working with, I cannot assume a normal distribution, nor can I necessarily assume a symmetric distribution. So, I am a little hesitant to simply fit through the means or medians of the data. Furthermore, I cannot blindly extend the above outlined approach to my matrix of dependent variables. If I tried to do so, this would result in ##m## estimates for ##a## and ##m## estimates for ##b##. That is if
$$\mathbf{Y} = \begin{bmatrix}y_{11} & y_{12} & \cdots & y_{1m} \\ y_{21} & y_{22} & \cdots & y_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nm}\end{bmatrix}$$
then
$$\mathbf{R}^{-1}\mathbf{Q}^{T}\mathbf{Y} = \begin{bmatrix} a_1 & a_2 & \cdots & a_m\\ b_1 & b_2 & \cdots & b_m\end{bmatrix}$$
This seems to be clearly undesirable because it raises the question: Which coefficient value should I use? Furthermore, it treats each column of the ##\mathbf{Y}## matrix as a single sequence, and each pair ##\left\{a_i, b_i\right\}## corresponds to the ##i##th column.

StoneTemplePython said:

t's not clear what robust ##a## and ##b## actual means -- maybe you meant stable not robust here? They are different

Agreed; I was not clear on what I wanted. I am hoping to get a few more options on how to linearly fit through the data. One obvious option is what I mentioned: Fit through the means and/or medians of each ##\mathbf{y}_i## (##i##th row of data matrix ##\mathbf{Y}##). Perhaps another option would be pulling random samples of each row, fitting through the means and/or median of each row subsample, then doing this over and over until I get a distribution on the linear fit coefficients. Are there other options? I would like something that I can defend whether through L1 or L2 norms (if this is even possible).

Thanks again for all the help.

FactChecker · Feb 14, 2018

CopyOfA said:

I would like something that I can defend whether through L1 or L2 norms (if this is even possible).

Linear regression minimizes the sum squared error, so it is very compatible with the L2 norm. My recommendation is to use the well established tools or be prepared to defend you decision not to. If you do something other that simple linear regression on the entire data set then the first question anyone will ask is why you did something else.

CopyOfA · Feb 14, 2018

FactChecker said:

Linear regression minimizes the sum squared error, so it is very compatible with the L2 norm

Simple linear regression would minimize the L2 norm, if the data were normally distributed (or I suspect, symmetric). As I mentioned, if the data ##\mathbf{y}_i## at each ##x_i## were normally distributed (or perhaps just symmetric), then simple linear regression on the mean would be absolutely defensible according to the L2 norm. However, my data is not normally distributed, nor symmetric.

FactChecker · Feb 14, 2018

CopyOfA said:

Simple linear regression would minimize the L2 norm, if the data were normally distributed (or I suspect, symmetric). As I mentioned, if the data ##\mathbf{y}_i## at each ##x_i## were normally distributed (or perhaps just symmetric), then simple linear regression on the mean would be absolutely defensible according to the L2 norm. However, my data is not normally distributed, nor symmetric.

Sorry, I missed that point. Are you hoping that the central limit theorem will give the sample mean of each time set an approximate normal distribution? In that case, as you suggested, you could do a simple linear regression using the mean y values for each set. That sounds like a reasonable approach.

CopyOfA · Feb 14, 2018

From what I can tell, the data ##\mathbf{y}_i## at each ##x_i## is not normally distributed. This is one reason that I'm hoping to find some other methods of regression beyond simply fitting through the means.

StoneTemplePython · Feb 14, 2018

CopyOfA said:

Are there other options? I would like something that I can defend whether through L1 or L2 norms (if this is even possible).

As much as I like probability, let's keep this simple:

you have an error term and some amount of data. You want to minimize in-sample error (computed with the help of some cost function) and hope it generalizes to out-of-sample. We need to know more on what you want out of this cost function.

High level: least squares as shown will minimize sum of squared errors. Linear Programming can be used to minimize sum of absolute value of errors. We could throw regularization parameters and a bunch of other stuff at this, but I mean you have only 2 columns (really one bias and one feature), so I wouldn't get carried away here.

StoneTemplePython · Feb 14, 2018

CopyOfA said:

$$\mathbf{Y} = \begin{bmatrix}y_{11} & y_{12} & \cdots & y_{1m} \\ y_{21} & y_{22} & \cdots & y_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nm}\end{bmatrix}$$
then
$$\mathbf{R}^{-1}\mathbf{Q}^{T}\mathbf{Y} = \begin{bmatrix} a_1 & a_2 & \cdots & a_m\\ b_1 & b_2 & \cdots & b_m\end{bmatrix}$$
This seems to be clearly undesirable because it raises the question: Which coefficient value should I use? Furthermore, it treats each column of the ##\mathbf{Y}## matrix as a single sequence, and each pair ##\left\{a_i, b_i\right\}## corresponds to the ##i##th column.

again simplify this and use the vec operator. Note how it works.

##vec\big(\mathbf Y\big) = \begin{bmatrix}
\mathbf y_1 \\
\mathbf y_2\\
\vdots \\
\mathbf y_{m-1}\\
\mathbf y_m
\end{bmatrix}##

when you have
##\mathbf Y = \bigg[\begin{array}{c|c|c|c|c}
\mathbf y_1 & \mathbf y_2 &\cdots & \mathbf y_{m-1} & \mathbf y_m\end{array}\bigg]##so your raw equation is
where
##\mathbf{x} := \begin{bmatrix}
x_1\\
x_1\\
x_3 \\
\vdots \\
x_n
\end{bmatrix}##

##
\begin{bmatrix}
\mathbf 1 & \mathbf x\\
\mathbf 1 & \mathbf x \\
\mathbf 1 & \mathbf x \\
\vdots & \vdots\\
\mathbf 1 & \mathbf x
\end{bmatrix}
\begin{bmatrix} a \\ b \end{bmatrix} = vec\big(\mathbf Y\big)##

now we we want to solve for an ##\mathbf a## that gives linear combination of the columns on the LHS and the vector that you get after subtracting ##vec\big(\mathbf Y\big)## minimizes some cost function. I.e. our goal is

##
\mathbf v:=

\begin{bmatrix}
\mathbf 1 & \mathbf x\\
\mathbf 1 & \mathbf x \\
\mathbf 1 & \mathbf x \\
\vdots & \vdots\\
\mathbf 1 & \mathbf x
\end{bmatrix}
\mathbf a - vec\big(\mathbf Y\big)##

##\text{minimize cost_function} \big(\mathbf v\big)##

CopyOfA · Feb 14, 2018

This is a clever way of setting up the problem. I'll play around with this, and with any luck, this thread will be closed. Thanks again!

FactChecker · Feb 14, 2018

CopyOfA said:

From what I can tell, the data ##\mathbf{y}_i## at each ##x_i## is not normally distributed. This is one reason that I'm hoping to find some other methods of regression beyond simply fitting through the means.

The central limit theorem says that the mean of samples from a reasonably behaved distribution will approach a normal distribution as the sample size gets large. This is true even if the distribution of the individual Y variable is not normal.

FactChecker · Feb 15, 2018

Whether the y values are normally distributed or not, the normal linear regression algorithm is almost certainly what you want to use. It will minimize the L2 norm. If the y values are not normally distributed, then you can not draw the probabilistic conclusions that you can if they are normally distributed. But I do not see you addressing the probabilities in any of your analysis above anyway. So you might as well use the standard regression tools -- just be careful about any probability statements you make.

BWV · Feb 15, 2018

I have trouble seeing this as a regression problem. If the LH predictor variable is a discrete time measurement, like the hour of the day, what use is a function of that arbitrary number for predicting out of sample? The typical time series regression tries to discern a linear relationship between a set of time-ordered values relative to a set of predictor variables measuring something else over the same period (perhaps with lags), but not the number of periods itself. Your model would need some plausable relationship between the actual discrete time stamp and the RH variable. What casual relationship are you trying to discern between these two?

The box plot does a great job of showing the range of data over time, nor sure what of value a regression would add to that

The other thing to look for with time ordered data is autocorrelation. If the value of y at time t_n is correlated with the value of y at time t_n-1 then the t-values will be overstated. There are a number of techniques such as GMM to correct for this.

Linear regression with discrete independent variable

Attachments

1. What is linear regression with a discrete independent variable?

2. How is linear regression with a discrete independent variable different from linear regression with a continuous independent variable?

3. What are the assumptions of linear regression with a discrete independent variable?

4. How is the significance of a discrete independent variable determined in linear regression?

5. Can a categorical variable be used as a discrete independent variable in linear regression?

Similar threads

Hot Threads

Recent Insights