Understanding multivariate linear regression

In summary, the conversation discusses the use of multivariate linear regression to predict the time it takes to run a process based on various parameters such as CPU usage, data read, and memory size. The coefficients generated from the regression are used to create a prediction model, but the accuracy of the model may be limited by the range of values used to calculate the coefficients and the complexity of the process being predicted.
  • #1
xeon123
90
0
I am trying to understand multivariate linear regression.

I have a list of time that it took running processes based on several params, like % of cpu usage, and data read. Eg, I have a process that took 50 seconds to run, with a cpu usage of 70%, and the process read 10bytes of data. I have another process that to 110 seconds to run with a cpu usage of 40% and it read 1Mb of data.

These are ficticious values.With these data, I calculated a multivariate linear regression, and I got 2 coefficients. C1 for the cpu usage, and C2 for the data read.

If i want to predict how much time it will take to run a next process, how the coefficients can help me?
Eg, I know that the CPU is at 30%, and I know that the process will 4Mb of data. Is it correct to predict the time that the process will run like:

t(pred) = C1 * 30% + C2 * 4Mb

Is this correct? I am mixing percentage with Mb to get seconds. Does this make sense? How these params can help me in the prediction?
 
Mathematics news on Phys.org
  • #2
If [itex]t(x,y) = c_1 x + c_2 y[/itex] is your model, then it is presumed that the coefficients will be of appropriate units [itex]\frac{\text{Time}}{\text{Percent}}[/itex] and so on.
Whether or not this is a reasonable model is a different question with no easy answer. However, if the changes in each variable is small, then one could see it as a first-order Taylor-series approximation of an underlying system in a small neighborhood of a certain point:
[tex]t(\vec{x}) = f(\vec x) = t(\vec{x}_0) + \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) + ... \approx t(\vec{x}_0) + \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) \\
t(\vec{x}) - t(\vec{x}_0) = \vec{J}_f(\vec{x}_0)(\vec x - \vec{x}_0) \\
\Delta t(\vec x) = \vec{J}_f(\vec{x}_0) \Delta \vec x[/tex]
where [itex]\vec{J}_f(\vec{x}_0)[/itex] is the Jacobian matrix of [itex]f[/itex] evaluated at [itex]\vec{x}_0[/itex].

In your case with your model, this point would be origin as I'd imagine a process that neither uses the CPU nor reads any data is a process that doesn't do anything, thus takes no time to complete. Thus the error in your model will grow the further away from the origin the input is, even more so when the data used to determine the coefficients isn't near the origin.

In reality, however, even with the linearized model I'd still say it isn't sufficient to predict for anything but the most similar processes. The time it takes to run a process has very little correlation with e.g. process usage. First you must obviously exclude non-halting processes. But even then, you can make a very simple process arbitrarily complex (well, within physical limits anyways). A process that outputs a zero? Think of how many ways you can do that, how mathematically complex you can make it. Even more if you exclude optimization phases at compile-time.
 
Last edited:
  • #3
if I have ## t(x,y,z)=c_1 x + c_2y + c_3z ##, where ##c_3## is the total memory of the system, and ##c_1## and ##c_2## are the same. Can I say that ##(t,y,z)## is seconds?

The ##t(x,y,z)## that I calculate is around 150000 (some unit, maybe seconds), and the set that I use to get the coefficients contains the time between 90 and 150 seconds. So, something is wrong here. Any help?

How I choose the order of importance of the variables to get the coefficients?
 
Last edited:
  • #4
If [itex]c_3[/itex] is the total amount of memory in the system then dimensional analysis dictates that [itex]z[/itex] has the unit of [itex]\frac{\text{Time}}{\text{Memory Unit}}[/itex]. If that's not the case then your model is dimensionally inconsistent.

For your second question, you may have made an error somewhere but I'd need more information.

Order shouldn't matter as long as you're consistent, if I understood you correctly. I assume you're familiar with solving this linear algebra problem, how the equation is set up and so on?
 
  • #5
The full units that I am using is:

## seconds = c_1 * \% + c_2 * number + c_3 * bytes + c_4 * MHz + c_5 * MBytes ##

So, the set that I use to get the coefficients contains,
  1. total time it took to run a process (seconds)
  2. cpu usage at the time (%)
  3. number of tasks to perform. That tasks are the number of threads that a process will launch to execute3
  4. total size of data that the process will read (bytes)
  5. CPU speed (MHz)
  6. Memory size (MBytes)
To calculate the coefficients, I have these params saved in a file. I put them in a matrix like this:

## yy = [ array\ with\ the\ seconds\ of\ each\ process\ execution ##
## xx = [ [ param_2, param_3, param_4, param_5, param_6],[param_2,param_3,...],...] ##

With these matrices I calculate the coefficients. Then I predict the time that will take to execute a process with the new coefficients:

Code:
predicted time = c_1 * cpu usage at the time + c_2 * number of tasks to run + c_3 * total size of data that the process will run + c_4 * CPU of the machine where the process will run + c_5 * total memory of the machine

When I calculate the predicted time, I get a number around 150000. My training set that I use to calculate the coefficients is:
Code:
xx = // param4, param2, param3, param5, param6
       [[4821328.0,    0.0,    4.0, 3016, 8094852],
        [3232269499.0, 0.0,     4.0, 3016, 8094852],
        [3232269499.0, 100.0, 4.0, 3016, 8094852],
        [4891564.0,    0.0,    4.0, 3016, 8094852],
        [3232250945.0, 0.0,     4.0, 3016, 8094852],
        [3232269499.0, 0.0,     4.0, 3016, 8094852],
        [3232269499.0, 0.0,     4.0, 3016, 8094852],
        [4891564.0,    70.83, 4.0, 3016, 8094852],
        [3232269499.0, 50.0,   4.0, 3016, 8094852],
        [4891564.0,    33.33, 4.0, 3016, 8094852]]

yy = // time that a process spent running (seconds)
        ['39.190294981',
         '128.185252905',
         '157.303498983',
         '38.8906679153',
         '106.472434998',
         '159.722404003',
         '145.585266829',
         '62.647646904',
         '157.055114985',
         '90.3209991455']

Just for curiosity, this is the python code that I use to calculate the coefficients:

Code:
def calculateLinearRegressionNumpy(xx, yy):
    """ calculate multivariate linear regression """
    import numpy as np
    from numpy import linalg

    A = np.column_stack((xx, np.ones(len(xx))))

    # linearly generated sequence
    coeffs = linalg.lstsq(A, yy)[0]  # obtaining the parameters

    # coefficients
    return coeffs

With the coefficients I calculate the predicted time:

Code:
def estimatejobexecution(coeffs, params):
    """ estimate the time to execute the job """

    # bytesread:currentqueuecapacity:tasks:cpu:mem
    wbytesread = coeffs[0]
    wqueueacapacity = coeffs[1]
    wmaps = coeffs[2]
    wcpu_info = coeffs[3]
    wmem_info = coeffs[4]

    time = (wbytesread*params[0]) + (wqueueacapacity*params[1]) + (wmaps*params[2]) + (wcpu_info*params[3]) + (wmem_info*params[4]) + weights[5]
    return time

Am I doing right to get the predicted time?
 
Last edited:
  • #6
Alright, first problem I spotted is that your matrix is going to be rank deficient. Shouldn't cause too many problems since lstsq at least in the documentation should take care of it. Most likely some coefficients will end up being zero. Secondly, your matrix is highly ill-conditioned. A standard linear least squares will hence give you larger-than-expected numerical errors. Testing it in Matlab, however, tells me that the errors might not be too big as long as your parameters afterwards don't deviate too far.

But what I think is causing problems is you may have mixed up coefficients with the parameters in this code snippet:
Code:
time = (wbytesread*params[0]) + (wqueueacapacity*params[1]) + (wmaps*params[2]) + (wcpu_info*params[3]) + (wmem_info*params[4]) + weights[5]
I also don't know where the weights came from nor where the sixth coefficient went. (You added a column of ones in A, so there should be one).

Remember that the coefficients come out in the order of the columns in the matrix. So if you have a column of coefficients [itex]\vec c[/itex] and a column of input data [itex]\vec x[/itex] of equal lengths and in the same order then the calculation you want to perform to compute the time is [itex]\vec{c}^\text{T} \vec x[/itex].
 
  • #7
I don't understand this sentence:

da_nang said:
I also don't know where the weights came from nor where the sixth coefficient went. (You added a column of ones in A, so there should be one).

The weights are the coefficients. I have named wrongly the coefficients. It is:

Code:
time = (wbytesread*params[0]) + (wqueueacapacity*params[1]) + (wmaps*params[2]) + (wcpu_info*params[3]) + (wmem_info*params[4]) + coeffs[5]

I think that the 6th param (coeffs[5]) comes from the column of ones (np.ones(len(xx)), and it is the intercept value (constant). In the equation below it is the ##b_0##. I am not sure if what I say is correct.

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704-EP713_MultivariableMethods/MLR3.png .

The way that I use the coeffs[5] is correct?
 
Last edited by a moderator:
  • #8
Okay, I see what you mean. What you need to check then is that the parameters are multiplied by the right coefficients.
 

1. What is multivariate linear regression?

Multivariate linear regression is a statistical analysis method used to predict the relationship between multiple independent variables and a dependent variable. It is a type of linear regression that takes into account multiple variables, rather than just one, to make predictions.

2. What is the purpose of using multivariate linear regression?

The purpose of using multivariate linear regression is to understand and quantify the relationships between multiple variables and a dependent variable. It can also be used to make predictions about the dependent variable based on the values of the independent variables.

3. How is multivariate linear regression different from simple linear regression?

Simple linear regression only takes into account one independent variable, while multivariate linear regression takes into account multiple independent variables. This allows for a more accurate and comprehensive understanding of the relationships between variables.

4. What are some common applications of multivariate linear regression?

Multivariate linear regression is commonly used in various fields, such as economics, finance, social sciences, and marketing, to analyze and predict the relationships between multiple variables. It is also used in machine learning and data analysis to make predictions and identify patterns in large datasets.

5. What are the key assumptions of multivariate linear regression?

The key assumptions of multivariate linear regression include linearity, normality, homoscedasticity, and independence of errors. Linearity assumes that the relationship between the independent variables and the dependent variable is linear. Normality assumes that the errors are normally distributed. Homoscedasticity assumes that the variance of the errors is constant across all values of the independent variables. Independence of errors assumes that the errors are not correlated with each other.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Linear and Abstract Algebra
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Calculus and Beyond Homework Help
Replies
3
Views
2K
  • Linear and Abstract Algebra
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
Back
Top