# Multivariate regression for project cost estimates

Hi All,

Hope this is the right sub-forum for this.

I currently work for an engineering consultancy and am doing some project management what for what are essentially site investigation reports (sites are quite small, typically a few acres/hectares). Our clients are mostly developers who have purchased land and want to build on it (can be residential, commercial or industrial development, it depends on the client).

I need to put together cost quotes for the site investigation work. I've not been asked to do this before and so was looking at how it's been done in the past and it seems to be mostly based on people's past experience/judgement calls etc. However, I want to put us on a more scientific footing rather than trying to "guess" what the answer will be. I was therefore considering multivariate regression analysis to try and determine what the contributing variables are to the actual cost and to come up with a mathematical model to try and predict future costs based on a limited range of site characteristics.

We have cost data on previous projects, principally the estimated cost, the hours staff spent on it, and then the actual cost (based on the actual staff hours). I also have information relating to each site that was investigated. Some of this I assume would be classed as "quantitative" data such as size of site (in hectares). However, a lot of it is what I would call "qualitative" such as as initial assessment of the level of pollution likely to be on the site, e.g. this is a simple done on a simple scale of 1-3, 1 = low, 2 = moderate, 3 = high.

One of the purposes of the site investigation is to quantify the level of pollution better but at the start we just have a broadscale (essentially desktop study) estimate based on the 1, 2 or 3 scale.

My question is therefore: can I use these sorts of scales in a multivariate regression analysis? In the above example the range 1, 2 or 3 (which is a qualitative judgement), I am not sure that this is treated the same as, for example, the site area (in hectares), which is a quantitative measure where I can simply compare one data point from the site equally with all the others.

Basically I need advice on how to take qualitative values and include them in multivariate regression analysis, as I am not even sure you are supposed to do this, let alone how to do it.

Many thanks in advance for any help provided. :)

Regards
Richard

Ps - I intend to use Excel for the analysis. I have tutorials which show how to use the LINEST function for this sort of thing. I believe I can do this part of the analysis, it's just how to classify/input the data into the model that I am unsure about.

Homework Helper
Are you using the qualitative variables as predictors (x-values)? If so, you can represent them with indicator variables.

Consider the one variable you mentioned that is rated at 1, 2, or 3. I need to reference it for illustration, so I'll call it Anon.

You can represent this information with two indicator variables, as

$$X_1 = \begin{cases} 1, \quad \text{ if } \text{Anon } = 1\\ 0, \quad \text{ if } \text{Anon } \ne 0 \end{cases}$$

Fill in the columns for these indicator variables according to these definitions.
For the rows where Anon = 1, you will have X1=1, X2=0. Wherever Anon = 2, you have X1=0, X2=1. For the rows with Anon = 3, you have X1=X2=0: the values of these two variables indicate which case of Anon you have.

For variables that have 5 ratings, you need 4 indicator variables.

You can find many, many locations on the web that discuss this in detail. Without more information, I'm not sure whether this will or will not fit your needs.

You are brave using Excel for this - it is probably the worst readily available software around for any multiple regression - there are many issues with its performance. Be very careful interpreting its output.

$$X_2 = \begin{cases} 1, \quad \text{ if } \text{Anon } = 2\\ 0, \quad \text{ if } \text{Anon } \ne 0$$

Thanks for the reply. Yes I am using the qualitative values as predictors (x-values). From what I've been reading about multiple (linear) regression analysis I will end up with an equation of the form:

y = b1x1 + b2x2 + ...bnxn + c

where b are the coefficients for each x predictor and c is a constant.

The data I am using for the regresson analysis just be columns of data, for example column A would be labelled "actual project cost" and column B "pollution rating", column C "site area in hectares" and so forth. I'd want to be able to estimate the actual project cost for future projects based on previous data.

You said there are lots of sites on the web discussing this - do you have any links that you can post?

Also, if Excel is not that good at this sort of thing, can you recommend any (preferably free) software that is?

Thanks again
Richard

Homework Helper
Here is one link that gives the same points i did, with a bit more discussion (the paper is free to download and read, PDF format).
http://www.keithbower.com/Basic Stats/On The Use of Indicator Variables.htm

Indicator variables are also referred to as 'dummy variables'.

http://www.masil.org/documents/dummy.pdf

Re Excel: I made my comment based on past experience working with, teaching with, and consulting with people who use Excel. Many of its statistical procedures have been poorly implemented, and some of the regression features are among them. I'm not sure whether your problem would be influenced by any of them, but I usually tend to the cautious side. If you're using Windows you can download a 30-day trial of Minitab from the website. It's full-featured and reliable, menu driven and with a better than average help system. Not available for any other OS.
There is R, which is extremly powerful, available for Windows, OS-X, linux, etc., but is mostly command line.
SAS, SPSS, and a few others, may offer some free trial versions of their software as well: we don't use them at my school so I haven't checked.

Last edited by a moderator:

On a final note - I was originally going to use the numbers 1, 2 and 3 to represent the site pollution levels in the regression analysis. I take it from what I've read so far that this would have been the wrong approach to take, and I would have ended up with an incorrect answer. The correct approach (if I'm understanding this right) is to use dummy variables and each of my pollution level "options" has to have a binary value to indicate whether it's true or false? (Where obviously only one can be true at a time for any given site).

Does this sound like I've understood the basic approach?

Thanks
-Richard

Last edited:
Homework Helper
"The correct approach (if I'm understanding this right) is to use dummy variables and each of my pollution level "options" has to have a binary value to indicate whether it's true or false? (Where obviously only one can be true at a time for any given site)."