Multivariate regression for project cost estimates

In summary: I am not sure how to input the data into the model to get the coefficients.I will look into this further and get back to you.In summary, Richard is considering multivariate regression analysis to determine what factors contribute to the cost of a site investigation. He is uncertain about how to include qualitative data in the analysis. He plans to use Excel for the analysis.
  • #1
Richard_R
14
0
Hi All,

Hope this is the right sub-forum for this.

I currently work for an engineering consultancy and am doing some project management what for what are essentially site investigation reports (sites are quite small, typically a few acres/hectares). Our clients are mostly developers who have purchased land and want to build on it (can be residential, commercial or industrial development, it depends on the client).

I need to put together cost quotes for the site investigation work. I've not been asked to do this before and so was looking at how it's been done in the past and it seems to be mostly based on people's past experience/judgement calls etc. However, I want to put us on a more scientific footing rather than trying to "guess" what the answer will be. I was therefore considering multivariate regression analysis to try and determine what the contributing variables are to the actual cost and to come up with a mathematical model to try and predict future costs based on a limited range of site characteristics.

We have cost data on previous projects, principally the estimated cost, the hours staff spent on it, and then the actual cost (based on the actual staff hours). I also have information relating to each site that was investigated. Some of this I assume would be classed as "quantitative" data such as size of site (in hectares). However, a lot of it is what I would call "qualitative" such as as initial assessment of the level of pollution likely to be on the site, e.g. this is a simple done on a simple scale of 1-3, 1 = low, 2 = moderate, 3 = high.

One of the purposes of the site investigation is to quantify the level of pollution better but at the start we just have a broadscale (essentially desktop study) estimate based on the 1, 2 or 3 scale.

My question is therefore: can I use these sorts of scales in a multivariate regression analysis? In the above example the range 1, 2 or 3 (which is a qualitative judgement), I am not sure that this is treated the same as, for example, the site area (in hectares), which is a quantitative measure where I can simply compare one data point from the site equally with all the others.

Basically I need advice on how to take qualitative values and include them in multivariate regression analysis, as I am not even sure you are supposed to do this, let alone how to do it.

Many thanks in advance for any help provided. :)

Regards
Richard

Ps - I intend to use Excel for the analysis. I have tutorials which show how to use the LINEST function for this sort of thing. I believe I can do this part of the analysis, it's just how to classify/input the data into the model that I am unsure about.
 
Physics news on Phys.org
  • #2
Are you using the qualitative variables as predictors (x-values)? If so, you can represent them with indicator variables.

Consider the one variable you mentioned that is rated at 1, 2, or 3. I need to reference it for illustration, so I'll call it Anon.

You can represent this information with two indicator variables, as

[tex]
X_1 =
\begin{cases}
1, \quad \text{ if } \text{Anon } = 1\\
0, \quad \text{ if } \text{Anon } \ne 0
\end{cases}
[/tex]

Fill in the columns for these indicator variables according to these definitions.
For the rows where Anon = 1, you will have X1=1, X2=0. Wherever Anon = 2, you have X1=0, X2=1. For the rows with Anon = 3, you have X1=X2=0: the values of these two variables indicate which case of Anon you have.

For variables that have 5 ratings, you need 4 indicator variables.

You can find many, many locations on the web that discuss this in detail. Without more information, I'm not sure whether this will or will not fit your needs.

You are brave using Excel for this - it is probably the worst readily available software around for any multiple regression - there are many issues with its performance. Be very careful interpreting its output.

[tex]
X_2 =
\begin{cases}
1, \quad \text{ if } \text{Anon } = 2\\
0, \quad \text{ if } \text{Anon } \ne 0
[/tex]
 
  • #3
Hi statdad,

Thanks for the reply. Yes I am using the qualitative values as predictors (x-values). From what I've been reading about multiple (linear) regression analysis I will end up with an equation of the form:

y = b1x1 + b2x2 + ...bnxn + c

where b are the coefficients for each x predictor and c is a constant.

The data I am using for the regresson analysis just be columns of data, for example column A would be labelled "actual project cost" and column B "pollution rating", column C "site area in hectares" and so forth. I'd want to be able to estimate the actual project cost for future projects based on previous data.

You said there are lots of sites on the web discussing this - do you have any links that you can post?

Also, if Excel is not that good at this sort of thing, can you recommend any (preferably free) software that is?

Thanks again
Richard
 
  • #4
Here is one link that gives the same points i did, with a bit more discussion (the paper is free to download and read, PDF format).
http://www.keithbower.com/Basic Stats/On The Use of Indicator Variables.htm

Indicator variables are also referred to as 'dummy variables'.

This article is interesting - a little more technical in spots.
http://www.masil.org/documents/dummy.pdf

Re Excel: I made my comment based on past experience working with, teaching with, and consulting with people who use Excel. Many of its statistical procedures have been poorly implemented, and some of the regression features are among them. I'm not sure whether your problem would be influenced by any of them, but I usually tend to the cautious side. If you're using Windows you can download a 30-day trial of Minitab from the website. It's full-featured and reliable, menu driven and with a better than average help system. Not available for any other OS.
There is R, which is extremely powerful, available for Windows, OS-X, linux, etc., but is mostly command line.
SAS, SPSS, and a few others, may offer some free trial versions of their software as well: we don't use them at my school so I haven't checked.
 
Last edited by a moderator:
  • #5
Thanks Statdad, the links look very useful. :)

On a final note - I was originally going to use the numbers 1, 2 and 3 to represent the site pollution levels in the regression analysis. I take it from what I've read so far that this would have been the wrong approach to take, and I would have ended up with an incorrect answer. The correct approach (if I'm understanding this right) is to use dummy variables and each of my pollution level "options" has to have a binary value to indicate whether it's true or false? (Where obviously only one can be true at a time for any given site).

Does this sound like I've understood the basic approach?

Thanks
-Richard
 
Last edited:
  • #6
"The correct approach (if I'm understanding this right) is to use dummy variables and each of my pollution level "options" has to have a binary value to indicate whether it's true or false? (Where obviously only one can be true at a time for any given site)."

Sounds like you have the idea. Glad the links helped. good luck with your project.
 
  • #7
You want to document previous knowledge to help future decision making. This is called knowledge engineering. And tons of research have been done for this kind application. I can think about Bayesian Net, and KNN approaches. Regression is also one method. However the margin of error really depends on how the regression model fits on your previous knowledge.

You need to minimize the error and before you can do that you should find a way to quantify the model error.

Tell you what, if you can allow me to publish a research paper on this project I can do this for you as my final report for this semester. I also could have my professor to read the report and assure the logical validity .
 

1. What is multivariate regression for project cost estimates?

Multivariate regression for project cost estimates is a statistical method used to analyze the relationship between multiple independent variables and a dependent variable, in this case, project cost. It allows for the identification of the key factors that influence project cost and the prediction of future project costs based on those factors.

2. How is multivariate regression used in project cost estimation?

Multivariate regression is used in project cost estimation by analyzing historical data on project costs and identifying the variables that have the most significant impact on cost. These variables are then used to create a regression equation that can be used to predict future project costs based on the values of those variables.

3. What are the benefits of using multivariate regression for project cost estimates?

There are several benefits of using multivariate regression for project cost estimates. It allows for a more accurate prediction of project costs by taking into account multiple variables that may affect cost. It also provides a more objective and data-driven approach to cost estimation, reducing the potential for human error or bias.

4. What are the limitations of using multivariate regression for project cost estimates?

While multivariate regression can be a useful tool for project cost estimation, it does have limitations. It requires a significant amount of historical data on project costs and the variables that may influence them. It also assumes a linear relationship between the variables and project cost, which may not always be the case.

5. How can multivariate regression be used to improve project cost management?

Multivariate regression can be used to improve project cost management by providing a more accurate and data-driven approach to cost estimation. By identifying the key factors that influence project cost, project managers can focus on controlling those factors to keep costs within budget. It can also help in identifying potential risks and developing contingency plans to mitigate their impact on project cost.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
6K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
936
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
336
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
864
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Back
Top