# Understanding multivariate linear regression

I am trying to understand multivariate linear regression.

I have a list of time that it took running processes based on several params, like % of cpu usage, and data read. Eg, I have a process that took 50 seconds to run, with a cpu usage of 70%, and the process read 10bytes of data. I have another process that to 110 seconds to run with a cpu usage of 40% and it read 1Mb of data.

These are ficticious values.

With these data, I calculated a multivariate linear regression, and I got 2 coefficients. C1 for the cpu usage, and C2 for the data read.

If i want to predict how much time it will take to run a next process, how the coefficients can help me?
Eg, I know that the CPU is at 30%, and I know that the process will 4Mb of data. Is it correct to predict the time that the process will run like:

t(pred) = C1 * 30% + C2 * 4Mb

Is this correct? I am mixing percentage with Mb to get seconds. Does this make sense? How these params can help me in the prediction?

If $t(x,y) = c_1 x + c_2 y$ is your model, then it is presumed that the coefficients will be of appropriate units $\frac{\text{Time}}{\text{Percent}}$ and so on.
Whether or not this is a reasonable model is a different question with no easy answer. However, if the changes in each variable is small, then one could see it as a first-order Taylor-series approximation of an underlying system in a small neighborhood of a certain point:
$$t(\vec{x}) = f(\vec x) = t(\vec{x}_0) + \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) + ... \approx t(\vec{x}_0) + \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) \\ t(\vec{x}) - t(\vec{x}_0) = \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) \\ \Delta t(\vec x) = \vec{J}_f(\vec{x}_0) \Delta \vec x$$
where $\vec{J}_f(\vec{x}_0)$ is the Jacobian matrix of $f$ evaluated at $\vec{x}_0$.

In your case with your model, this point would be origin as I'd imagine a process that neither uses the CPU nor reads any data is a process that doesn't do anything, thus takes no time to complete. Thus the error in your model will grow the further away from the origin the input is, even more so when the data used to determine the coefficients isn't near the origin.

In reality, however, even with the linearized model I'd still say it isn't sufficient to predict for anything but the most similar processes. The time it takes to run a process has very little correlation with e.g. process usage. First you must obviously exclude non-halting processes. But even then, you can make a very simple process arbitrarily complex (well, within physical limits anyways). A process that outputs a zero? Think of how many ways you can do that, how mathematically complex you can make it. Even more if you exclude optimization phases at compile-time.

Last edited:
if I have ## t(x,y,z)=c_1 x + c_2y + c_3z ##, where ##c_3## is the total memory of the system, and ##c_1## and ##c_2## are the same. Can I say that ##(t,y,z)## is seconds?

The ##t(x,y,z)## that I calculate is around 150000 (some unit, maybe seconds), and the set that I use to get the coefficients contains the time between 90 and 150 seconds. So, something is wrong here. Any help?

How I choose the order of importance of the variables to get the coefficients?

Last edited:
If $c_3$ is the total amount of memory in the system then dimensional analysis dictates that $z$ has the unit of $\frac{\text{Time}}{\text{Memory Unit}}$. If that's not the case then your model is dimensionally inconsistent.

Order shouldn't matter as long as you're consistent, if I understood you correctly. I assume you're familiar with solving this linear algebra problem, how the equation is set up and so on?

The full units that I am using is:

## seconds = c_1 * \% + c_2 * number + c_3 * bytes + c_4 * MHz + c_5 * MBytes ##

So, the set that I use to get the coefficients contains,
1. total time it took to run a process (seconds)
2. cpu usage at the time (%)
3. number of tasks to perform. That tasks are the number of threads that a process will launch to execute3
4. total size of data that the process will read (bytes)
5. CPU speed (MHz)
6. Memory size (MBytes)
To calculate the coefficients, I have these params saved in a file. I put them in a matrix like this:

## yy = [ array\ with\ the\ seconds\ of\ each\ process\ execution ##
## xx = [ [ param_2, param_3, param_4, param_5, param_6],[param_2,param_3,...],...] ##

With these matrices I calculate the coefficients. Then I predict the time that will take to execute a process with the new coefficients:

Code:
predicted time = c_1 * cpu usage at the time + c_2 * number of tasks to run + c_3 * total size of data that the process will run + c_4 * CPU of the machine where the process will run + c_5 * total memory of the machine

When I calculate the predicted time, I get a number around 150000. My training set that I use to calculate the coefficients is:

Code:
xx = // param4, param2, param3, param5, param6
[[4821328.0,    0.0,    4.0, 3016, 8094852],
[3232269499.0, 0.0,     4.0, 3016, 8094852],
[3232269499.0, 100.0, 4.0, 3016, 8094852],
[4891564.0,    0.0,    4.0, 3016, 8094852],
[3232250945.0, 0.0,     4.0, 3016, 8094852],
[3232269499.0, 0.0,     4.0, 3016, 8094852],
[3232269499.0, 0.0,     4.0, 3016, 8094852],
[4891564.0,    70.83, 4.0, 3016, 8094852],
[3232269499.0, 50.0,   4.0, 3016, 8094852],
[4891564.0,    33.33, 4.0, 3016, 8094852]]

yy = // time that a process spent running (seconds)
['39.190294981',
'128.185252905',
'157.303498983',
'38.8906679153',
'106.472434998',
'159.722404003',
'145.585266829',
'62.647646904',
'157.055114985',
'90.3209991455']

Just for curiosity, this is the python code that I use to calculate the coefficients:

Code:
def calculateLinearRegressionNumpy(xx, yy):
""" calculate multivariate linear regression """
import numpy as np
from numpy import linalg

A = np.column_stack((xx, np.ones(len(xx))))

# linearly generated sequence
coeffs = linalg.lstsq(A, yy)  # obtaining the parameters

# coefficients
return coeffs

With the coefficients I calculate the predicted time:

Code:
def estimatejobexecution(coeffs, params):
""" estimate the time to execute the job """

wqueueacapacity = coeffs
wmaps = coeffs
wcpu_info = coeffs
wmem_info = coeffs

time = (wbytesread*params) + (wqueueacapacity*params) + (wmaps*params) + (wcpu_info*params) + (wmem_info*params) + weights
return time

Am I doing right to get the predicted time?

Last edited:
Alright, first problem I spotted is that your matrix is going to be rank deficient. Shouldn't cause too many problems since lstsq at least in the documentation should take care of it. Most likely some coefficients will end up being zero. Secondly, your matrix is highly ill-conditioned. A standard linear least squares will hence give you larger-than-expected numerical errors. Testing it in Matlab, however, tells me that the errors might not be too big as long as your parameters afterwards don't deviate too far.

But what I think is causing problems is you may have mixed up coefficients with the parameters in this code snippet:
Code:
time = (wbytesread*params) + (wqueueacapacity*params) + (wmaps*params) + (wcpu_info*params) + (wmem_info*params) + weights
I also don't know where the weights came from nor where the sixth coefficient went. (You added a column of ones in A, so there should be one).

Remember that the coefficients come out in the order of the columns in the matrix. So if you have a column of coefficients $\vec c$ and a column of input data $\vec x$ of equal lengths and in the same order then the calculation you want to perform to compute the time is $\vec{c}^\text{T} \vec x$.

I don't understand this sentence:

I also don't know where the weights came from nor where the sixth coefficient went. (You added a column of ones in A, so there should be one).

The weights are the coefficients. I have named wrongly the coefficients. It is:

Code:
time = (wbytesread*params) + (wqueueacapacity*params) + (wmaps*params) + (wcpu_info*params) + (wmem_info*params) + coeffs

I think that the 6th param (coeffs) comes from the column of ones (np.ones(len(xx)), and it is the intercept value (constant). In the equation below it is the ##b_0##. I am not sure if what I say is correct.

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704-EP713_MultivariableMethods/MLR3.png [Broken] .

The way that I use the coeffs is correct?

Last edited by a moderator:
Okay, I see what you mean. What you need to check then is that the parameters are multiplied by the right coefficients.