Regression analysis sample size problem

In summary, the conversation discusses a problem with determining the sample size for a linear regression model used to predict energy consumption of an electric car. The participants offer advice on how to ensure that the sample is representative of the population and suggest collecting data on factors such as time of day, location, and terrain. They also discuss the need to define the term "sufficient" and the struggle to turn real world problems into mathematical ones. Finally, the conversation touches on a specific problem of recording travel patterns of cars for 80 families for 2 years and determining if the distributions of variables are representative.
  • #1
bradyj7
122
0
Hi there,

Could anybody offer any advice on a linear regression sample size problem?

I am using regression to predict the energy consumption (watt/mile) of an electric car based on a number of parameters such as average velocity, max velocity, average acceleration, the number of stops per mile etc. In total there are 15 parameters.

I record the energy consumption of a car for a journey and compute these parameters.

So the equation would like:

Energy Consumption = .23Average_Velocity + .5 Average_acceleration + ... etc

My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression? Assuming the same city, same driver, same weather, same driving style etc

I'd appreciate any advice you could offer

Thank you

John
 
Physics news on Phys.org
  • #2
Hey bradyj7.

There are a few things that you need to know about a regression.

The first thing is that the regression tries to fit the existing data and the sample is not representative of the population, then the regression won't be useful just like estimating a distribution mean from a sample that is skewed massively to the left or right won't represent the true underlying mean of the population.

So in saying this, you will have to figure out if the sample you have has some decent amount of correspondence with the overall nature of the population data. This is one of the biggest problems statisticians face in that they need to collect a sample, but the sample needs to be a good representation.

To do this effectively, it's not merely just about mathematics: it's about the actual sampling strategy and this will depend on the context of your data. You will need to decide what data is important and representative and what isn't and there is no one answer for this.

Now in terms of the mathematics, the algorithms used will spit out not only the estimates for the coeffecients, but also the standard error which corresponds to a kind of standard deviation but for a sample. If the standard error is high, then it means that for that particular coeffecient, the uncertainty is high. When its small, it means that the estimate given the data is going to be very good. If you get a standard of zero it means that the prediction for that variable is exact and if they are all zero then you have a completely deterministic model.

Now some advice of my own (which you will have to decide whether its useful or not and I encourage you do your own research):

Start off by trying to find out what data is important and what is representative. You are collecting journey data so think about factors like time of day, location, demographics (earnings and income, age distribution), price of energy for electric, gas, etc, as well as the terrain used for driving (flat, lots of hills, urban, lots of traffic lights, city, country, etc) and then try and simplify it in a way to get the attributes that are going to have the most impact on the data (think of each thing as a weight and choose the weights that have the biggest impact).

Also if you collect more data, you should also do the following:

After each collection do a regression, get the output and look at the values of the coeffecients and standard errors. Save this info in a document. Then do a few predictions both inside the scope of the data and outside of the scope. Save these predictions in the same file.

When you collect new data compare the new coeffecients, standard errors, and predictions made on the new data with the other ones.

This will give you an idea of how representative your data at each collection stage really was and will indirectly tell you things that are important because you can look back at the context of the data and see what is different.

So at each collection stage take a note of anything different qualitatively for what you are measuring so that you can really narrow down what was different if your regression model changed wildly.
 
  • #3
bradyj7 said:
So the equation would like:
Energy Consumption = .23Average_Velocity + .5 Average_acceleration + ... etc
You are assuming a linear relationship to each variable. You will be much better off if you can start with a model (equation structure) based on physical principles. Failing that, look out for indications of nonlinearity in the data. OTOH, don't get carried away with higher-order terms based purely on the data. There is a science of justifying mathematical models: it penalises the model according to the number of arbitrary coefficients.
 
  • #4
Thank you for your help and time Chiro and haruspex, I'll take your advice into consideration.
 
  • #5
bradyj7 said:
My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression?

The word "sufficient" has no precise definition in this context. If you are willing to say what it means, precisely you might get a specific answer.

In facing real world problems, It's very common to hear people express their goals using words that have a connotation of goodness, respectability and scientific merit (such as "sufficient", "optimal", "best", "valid", "reliable"). These words express genuine human longings, but no specific mathematical requirements. It's usually a big mental struggle to define a real world problem as an actual mathematical problem. In statitistical problems, people usually take a variety of different approaches, look at the answers and see "how they like them". I don't know whether this should be viewed as exposing normal human failings or as testing the robustness of various assumptions!
 
  • #6
Hi,

I was hoping that somebody might be in a position to advise me on a problem that I have.

I am recording the travel patterns of a car for 80 different families for 2 years. Specifically I am recording the following:

1.Journey Start Times
2.Journey End Times
3. Distances
4. No. of journeys per day

Is there a method by which I could investigate/prove that the distributions of these variables are representative of the general population?

I would appreciate any comments or feedback.

Look forward to hearing form you.

Thanks

John
 
  • #7
You could try reading http://en.wikipedia.org/wiki/Sample_size, but I suspect your main difficulty will be a sampling bias, i.e. the way you selected the 80 families makes them unlikely to be representative of the whole population.
 
  • #8
haruspex said:
You could try reading http://en.wikipedia.org/wiki/Sample_size, but I suspect your main difficulty will be a sampling bias, i.e. the way you selected the 80 families makes them unlikely to be representative of the whole population.

How can you tell from reading the post how the families were selected ?
 
  • #9
bradyj7 said:
Hi there,

Could anybody offer any advice on a linear regression sample size problem?

I am using regression to predict the energy consumption (watt/mile) of an electric car based on a number of parameters such as average velocity, max velocity, average acceleration, the number of stops per mile etc. In total there are 15 parameters.

I record the energy consumption of a car for a journey and compute these parameters.

So the equation would like:

Energy Consumption = .23Average_Velocity + .5 Average_acceleration + ... etc

My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression? Assuming the same city, same driver, same weather, same driving style etc

I'd appreciate any advice you could offer

Thank you

John

You may want to look into Factor Analysis too, re part of what Chiro said, which may

allow you to reduce the number of factors that you are considering:

http://en.wikipedia.org/wiki/Factor_analysis
 
  • #10
Bacle2 said:
How can you tell from reading the post how the families were selected ?
It was an example of bias that can occur. "may make" would have been less confusing.
 
  • #11
Yes, sorry, I realized later that's what you meant; not too common-sensical

on my part.
 

1. What is regression analysis sample size problem?

The regression analysis sample size problem refers to the issue of determining the appropriate number of observations or data points to include in a regression analysis. This is important because a sample size that is too small may not accurately represent the population, leading to biased or unreliable results.

2. Why is sample size important in regression analysis?

Sample size is important in regression analysis because it affects the accuracy and precision of the estimated coefficients and their corresponding standard errors. A larger sample size generally leads to more reliable results and a smaller margin of error.

3. How do I determine the appropriate sample size for regression analysis?

The appropriate sample size for regression analysis depends on various factors such as the desired level of precision, the complexity of the model, and the amount of variation in the data. Generally, a larger sample size is recommended for more complex models and when the variables have high variability.

4. What are the consequences of using a sample size that is too small in regression analysis?

Using a sample size that is too small in regression analysis can lead to biased and unreliable results. This can result in incorrect conclusions and potentially misleading insights. It can also make it difficult to generalize the results to the larger population.

5. Can I still get meaningful results from a regression analysis with a small sample size?

While a larger sample size is generally preferred for more accurate results, meaningful insights can still be obtained from a regression analysis with a small sample size. However, it is important to note the limitations and potential bias in the results and to interpret them with caution.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
Replies
1
Views
812
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
5K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
3
Views
2K
Back
Top