Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Regression analysis sample size problem

  1. Jul 9, 2012 #1
    Hi there,

    Could anybody offer any advice on a linear regression sample size problem?

    I am using regression to predict the energy consumption (watt/mile) of an electric car based on a number of parameters such as average velocity, max velocity, average acceleration, the number of stops per mile etc. In total there are 15 parameters.

    I record the energy consumption of a car for a journey and compute these parameters.

    So the equation would like:

    Energy Consumption = .23Average_Velocity + .5 Average_acceleration + .... etc

    My question is, how would I determine how many journeys I would need to get a sufficient sample size for the regression? Assuming the same city, same driver, same weather, same driving style etc

    I'd appreciate any advice you could offer

    Thank you

    John
     
  2. jcsd
  3. Jul 9, 2012 #2

    chiro

    User Avatar
    Science Advisor

    Hey bradyj7.

    There are a few things that you need to know about a regression.

    The first thing is that the regression tries to fit the existing data and the sample is not representative of the population, then the regression won't be useful just like estimating a distribution mean from a sample that is skewed massively to the left or right won't represent the true underlying mean of the population.

    So in saying this, you will have to figure out if the sample you have has some decent amount of correspondence with the overall nature of the population data. This is one of the biggest problems statisticians face in that they need to collect a sample, but the sample needs to be a good representation.

    To do this effectively, it's not merely just about mathematics: it's about the actual sampling strategy and this will depend on the context of your data. You will need to decide what data is important and representative and what isn't and there is no one answer for this.

    Now in terms of the mathematics, the algorithms used will spit out not only the estimates for the coeffecients, but also the standard error which corresponds to a kind of standard deviation but for a sample. If the standard error is high, then it means that for that particular coeffecient, the uncertainty is high. When its small, it means that the estimate given the data is going to be very good. If you get a standard of zero it means that the prediction for that variable is exact and if they are all zero then you have a completely deterministic model.

    Now some advice of my own (which you will have to decide whether its useful or not and I encourage you do your own research):

    Start off by trying to find out what data is important and what is representative. You are collecting journey data so think about factors like time of day, location, demographics (earnings and income, age distribution), price of energy for electric, gas, etc, as well as the terrain used for driving (flat, lots of hills, urban, lots of traffic lights, city, country, etc) and then try and simplify it in a way to get the attributes that are going to have the most impact on the data (think of each thing as a weight and choose the weights that have the biggest impact).

    Also if you collect more data, you should also do the following:

    After each collection do a regression, get the output and look at the values of the coeffecients and standard errors. Save this info in a document. Then do a few predictions both inside the scope of the data and outside of the scope. Save these predictions in the same file.

    When you collect new data compare the new coeffecients, standard errors, and predictions made on the new data with the other ones.

    This will give you an idea of how representative your data at each collection stage really was and will indirectly tell you things that are important because you can look back at the context of the data and see what is different.

    So at each collection stage take a note of anything different qualitatively for what you are measuring so that you can really narrow down what was different if your regression model changed wildly.
     
  4. Jul 10, 2012 #3

    haruspex

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member
    2016 Award

    You are assuming a linear relationship to each variable. You will be much better off if you can start with a model (equation structure) based on physical principles. Failing that, look out for indications of nonlinearity in the data. OTOH, don't get carried away with higher-order terms based purely on the data. There is a science of justifying mathematical models: it penalises the model according to the number of arbitrary coefficients.
     
  5. Jul 10, 2012 #4
    Thank you for your help and time Chiro and haruspex, I'll take your advice into consideration.
     
  6. Jul 10, 2012 #5

    Stephen Tashi

    User Avatar
    Science Advisor

    The word "sufficient" has no precise definition in this context. If you are willing to say what it means, precisely you might get a specific answer.

    In facing real world problems, It's very common to hear people express their goals using words that have a connotation of goodness, respectability and scientific merit (such as "sufficient", "optimal", "best", "valid", "reliable"). These words express genuine human longings, but no specific mathematical requirements. It's usually a big mental struggle to define a real world problem as an actual mathematical problem. In statitistical problems, people usually take a variety of different approaches, look at the answers and see "how they like them". I don't know whether this should be viewed as exposing normal human failings or as testing the robustness of various assumptions!
     
  7. Sep 3, 2012 #6
    Hi,

    I was hoping that somebody might be in a position to advise me on a problem that I have.

    I am recording the travel patterns of a car for 80 different families for 2 years. Specifically I am recording the following:

    1.Journey Start Times
    2.Journey End Times
    3. Distances
    4. No. of journeys per day

    Is there a method by which I could investigate/prove that the distributions of these variables are representative of the general population?

    I would appreciate any comments or feedback.

    Look forward to hearing form you.

    Thanks

    John
     
  8. Sep 3, 2012 #7

    haruspex

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member
    2016 Award

    You could try reading http://en.wikipedia.org/wiki/Sample_size, but I suspect your main difficulty will be a sampling bias, i.e. the way you selected the 80 families makes them unlikely to be representative of the whole population.
     
  9. Sep 4, 2012 #8

    Bacle2

    User Avatar
    Science Advisor

    How can you tell from reading the post how the families were selected ?
     
  10. Sep 4, 2012 #9

    Bacle2

    User Avatar
    Science Advisor

    You may want to look into Factor Analysis too, re part of what Chiro said, which may

    allow you to reduce the number of factors that you are considering:

    http://en.wikipedia.org/wiki/Factor_analysis
     
  11. Sep 4, 2012 #10

    haruspex

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member
    2016 Award

    It was an example of bias that can occur. "may make" would have been less confusing.
     
  12. Sep 4, 2012 #11

    Bacle2

    User Avatar
    Science Advisor

    Yes, sorry, I realized later that's what you meant; not too common-sensical

    on my part.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Regression analysis sample size problem
  1. Sample Size (Replies: 7)

  2. Regression analysis? (Replies: 11)

Loading...