# Discrepency events

1. Oct 22, 2006

### BiologyGirl

I am having problems deciding on which statistical measure to use. Although this problem is of the simplest type, none of my books seem to address exactly what I need.

Let me describe a typical example:

Suppose that I have a factory that produces four types of products, say boats, cars, planes, and trains. (Big factory, I know). For ideal production, I want 20% of the factory's output to be boats, 25% to be cars, 40% to be planes, and 15% to be trains.

Suppose the factory instead produces 15% boats, 20% cars, 40% planes, and 25% trains. How would I express, using one statistical measure, how far off the factory is producing from the ideal?

I want the answer to be a percentage so that 100% is perfect alignment with the targeted goals, 0% would be no alignment (the factory produces spaceships instead).

My first inclination was to use simply find the average discrepency, that is, take the absolute values of each difference and average them. If needed, I could weight the result to produce a result between 0 and 100%, but something tells me that my plan is too unsophisticated. Is there a form of the linear regression that I could use on data that is not described by a function but represented in terms of finite data? What about weighting the standard deviation?

As you can tell, I am not a statistician (all of my experience is using statistics on functional data), but if I was just told the name of the statistical measure to use I could figure out the rest on my own.

2. Nov 3, 2006

### EnumaElish

You are looking for a metric (distance function) so the sum of absolute differences is fine. An alternative measure is the sum of squared differences (or errors, i.e. SSE). You may have to take into account that the percentages always add up to 100%, so you only need to know 3 out of 4. In a regression the errors are based on the difference between an actual value and a projected (estimated) value -- as far as I understand, you are not trying to project anything; whether or how regression might help is not obvious.

3. Nov 15, 2006

### BiologyGirl

Thanks for the response. What advantage does the sum of squared differences have over the sum of absolute differences?

4. Nov 17, 2006

### CRGreathouse

1. It's often easier to analyze mathematically
2. The worst values are weighted more heavily, so "0% boats, 20% cars" is worse than "10% boats, 15% cars" in your example.
3. There aren't continuous of values that are considered 'equally bad', which makes it hard to decide what to prefer.

There are some advantages of absolute differences vs sum of squared differences:
1. It's easier to calculate by hand
2. All values are equally weighted -- the opposite of #2 above

5. Nov 21, 2006

### EnumaElish

Visually, think of each day's observed production as a point P in a 3-dimensional space, located according to the 3 coordinates "boats," "cars," "planes." Put the origin (of the coordinates) at "the ideal production" (20%, 25%, 40%). In this setup, the three "production errors" with respect to the ideal (= the origin) along the 3 coordinates exactly describe the location of P. Suppose Monday's production was 10%, 10%, 60% (implicitly, trains = 20%). Then the 3 coordinates are -10%, -15%, 20%. The (square root of) SSE is the equivalent of the Euclidian metric. The sum of absolute distances is the equivalent of the Manhattan metric.

Last edited: Nov 21, 2006
6. Nov 22, 2006

### BiologyGirl

Thanks for the responses. I think the SSE is the best measure and I will go with it.

7. Nov 22, 2006

### Office_Shredder

Staff Emeritus
If you take the euclidian style, you won't get a perfect production = 1, awful production = 0 though. So what you need to do is take what your metric value is, and divide it by the value of whatever the worst possible production is (square root of .152 + .22, etc.). Then take 1 - that value, and 1 is the best possible production, 0 is the worst.

That's how I would do it anyway. I'm sure someboy knows a better way