# Regression analysis sample size problem

Hi there,

Could anybody offer any advice on a linear regression sample size problem?

I am using regression to predict the energy consumption (watt/mile) of an electric car based on a number of parameters such as average velocity, max velocity, average acceleration, the number of stops per mile etc. In total there are 15 parameters.

I record the energy consumption of a car for a journey and compute these parameters.

So the equation would like:

Energy Consumption = .23Average_Velocity + .5 Average_acceleration + .... etc

My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression? Assuming the same city, same driver, same weather, same driving style etc

I'd appreciate any advice you could offer

Thank you

John

## Answers and Replies

chiro
Science Advisor
Hey bradyj7.

There are a few things that you need to know about a regression.

The first thing is that the regression tries to fit the existing data and the sample is not representative of the population, then the regression won't be useful just like estimating a distribution mean from a sample that is skewed massively to the left or right won't represent the true underlying mean of the population.

So in saying this, you will have to figure out if the sample you have has some decent amount of correspondence with the overall nature of the population data. This is one of the biggest problems statisticians face in that they need to collect a sample, but the sample needs to be a good representation.

To do this effectively, it's not merely just about mathematics: it's about the actual sampling strategy and this will depend on the context of your data. You will need to decide what data is important and representative and what isn't and there is no one answer for this.

Now in terms of the mathematics, the algorithms used will spit out not only the estimates for the coeffecients, but also the standard error which corresponds to a kind of standard deviation but for a sample. If the standard error is high, then it means that for that particular coeffecient, the uncertainty is high. When its small, it means that the estimate given the data is going to be very good. If you get a standard of zero it means that the prediction for that variable is exact and if they are all zero then you have a completely deterministic model.

Now some advice of my own (which you will have to decide whether its useful or not and I encourage you do your own research):

Start off by trying to find out what data is important and what is representative. You are collecting journey data so think about factors like time of day, location, demographics (earnings and income, age distribution), price of energy for electric, gas, etc, as well as the terrain used for driving (flat, lots of hills, urban, lots of traffic lights, city, country, etc) and then try and simplify it in a way to get the attributes that are going to have the most impact on the data (think of each thing as a weight and choose the weights that have the biggest impact).

Also if you collect more data, you should also do the following:

After each collection do a regression, get the output and look at the values of the coeffecients and standard errors. Save this info in a document. Then do a few predictions both inside the scope of the data and outside of the scope. Save these predictions in the same file.

When you collect new data compare the new coeffecients, standard errors, and predictions made on the new data with the other ones.

This will give you an idea of how representative your data at each collection stage really was and will indirectly tell you things that are important because you can look back at the context of the data and see what is different.

So at each collection stage take a note of anything different qualitatively for what you are measuring so that you can really narrow down what was different if your regression model changed wildly.

haruspex
Science Advisor
Homework Helper
Gold Member
2020 Award
So the equation would like:
Energy Consumption = .23Average_Velocity + .5 Average_acceleration + .... etc
You are assuming a linear relationship to each variable. You will be much better off if you can start with a model (equation structure) based on physical principles. Failing that, look out for indications of nonlinearity in the data. OTOH, don't get carried away with higher-order terms based purely on the data. There is a science of justifying mathematical models: it penalises the model according to the number of arbitrary coefficients.

Thank you for your help and time Chiro and haruspex, I'll take your advice into consideration.

Stephen Tashi
Science Advisor
My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression?

The word "sufficient" has no precise definition in this context. If you are willing to say what it means, precisely you might get a specific answer.

In facing real world problems, It's very common to hear people express their goals using words that have a connotation of goodness, respectability and scientific merit (such as "sufficient", "optimal", "best", "valid", "reliable"). These words express genuine human longings, but no specific mathematical requirements. It's usually a big mental struggle to define a real world problem as an actual mathematical problem. In statitistical problems, people usually take a variety of different approaches, look at the answers and see "how they like them". I don't know whether this should be viewed as exposing normal human failings or as testing the robustness of various assumptions!

Hi,

I was hoping that somebody might be in a position to advise me on a problem that I have.

I am recording the travel patterns of a car for 80 different families for 2 years. Specifically I am recording the following:

1.Journey Start Times
2.Journey End Times
3. Distances
4. No. of journeys per day

Is there a method by which I could investigate/prove that the distributions of these variables are representative of the general population?

I would appreciate any comments or feedback.

Look forward to hearing form you.

Thanks

John

haruspex
Science Advisor
Homework Helper
Gold Member
2020 Award
You could try reading http://en.wikipedia.org/wiki/Sample_size, but I suspect your main difficulty will be a sampling bias, i.e. the way you selected the 80 families makes them unlikely to be representative of the whole population.

Bacle2
Science Advisor
You could try reading http://en.wikipedia.org/wiki/Sample_size, but I suspect your main difficulty will be a sampling bias, i.e. the way you selected the 80 families makes them unlikely to be representative of the whole population.

How can you tell from reading the post how the families were selected ?

Bacle2
Science Advisor
Hi there,

Could anybody offer any advice on a linear regression sample size problem?

I am using regression to predict the energy consumption (watt/mile) of an electric car based on a number of parameters such as average velocity, max velocity, average acceleration, the number of stops per mile etc. In total there are 15 parameters.

I record the energy consumption of a car for a journey and compute these parameters.

So the equation would like:

Energy Consumption = .23Average_Velocity + .5 Average_acceleration + .... etc

My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression? Assuming the same city, same driver, same weather, same driving style etc

I'd appreciate any advice you could offer

Thank you

John

You may want to look into Factor Analysis too, re part of what Chiro said, which may

allow you to reduce the number of factors that you are considering:

http://en.wikipedia.org/wiki/Factor_analysis

haruspex
Science Advisor
Homework Helper
Gold Member
2020 Award
How can you tell from reading the post how the families were selected ?
It was an example of bias that can occur. "may make" would have been less confusing.

Bacle2
Science Advisor
Yes, sorry, I realized later that's what you meant; not too common-sensical

on my part.