When is it ok to exclude data for a model to fit better?

  1. Hello people,
    I'm working on a research project involving the modelling of a black body's radiation spectrum, and near the limits of the spectrometer, the intensity goes to zero rather sharply.

    I cannot directly say it is error from the spectrometer, because I simply don't know. However, I know it isn't what the model should look like.

    Here is the dilemma, when I fit all 1100 points, I have an r^2 value of roughly .733, which is bad.
    Yet If I exclude about 5-10% of the data on each end before I do the regression, I get an r^2 around .995 which is ideal.

    So the question is: should I exclude them, and how much excluded data is reasonable?

    I will be presenting this at a university, and I am assuming I will be questioned on my methods here. So how do I reason this? Is there a rule of thumb for this sort of thing?

    Last edited: Sep 9, 2013
  2. jcsd
  3. SteamKing

    SteamKing 11,019
    Staff Emeritus
    Science Advisor
    Homework Helper

    Sometimes, you take too much data. You may not be able to explain why each suspect data point falls out of range. With 1100 data points, excluding ten percent still leaves approx. 1000 data points.

    Rather than submerge the project in a tidal wave of data, perhaps you should do a series of trial data collections, where you take data from the entire range of the spectrum, but do it in several independent series. In this fashion, if something hasn't warmed up or whatever, its effect can be easily separated out from the rest of the data.

    Without knowing more about your experiment, it is hard to offer further suggestions.
  4. Vanadium 50

    Vanadium 50 18,489
    Staff Emeritus
    Science Advisor
    Education Advisor

  5. Here is more on the experiment.
    Took a spectrometer to a light bulb and recorded the data.
    Fit the data to this function:
    Sλ = 2πc2hα / λ5(ehc/λkT -1) + constant

    Here are two plots illustrating the difference this makes.
    The data shown is raw, the difference between the plots is whether data is excluded before the fitting.
    Data points outside the blue dotted lines were not used when fitting.


    Last edited: Sep 9, 2013
  6. D H

    Staff: Mentor

    Throwing out data is a bit of an art. Throw out too much and you run the risk of doctoring the data, of seeing a conclusion that is not justified. Use all of it, and as you have seen, you run the risk of bad data corrupting your results.

    Given that double peak, coupled with the bad fit, it looks as if you are using the spectrometer beyond the range it should be used in. Have you looked at the spectrometer's specification manual?
  7. I assumed it wouldn't gather data beyond it's usable range. But that's worth looking into. thanks.
  8. You need to find out more about your spectrometer. Do you have recent calibration data? Take some shots of a calibrated lamp
  9. SteamKing

    SteamKing 11,019
    Staff Emeritus
    Science Advisor
    Homework Helper

    Definitely falls into the category, 'Something's messed up with the measurement device.'

    It's also bad practice to think that you can rely on your instruments to follow good experimental procedure when you can't be bothered to. ('I assumed it wouldn't gather data beyond it's (sic) usable range.')
  10. davenn

    davenn 4,368
    Science Advisor
    Gold Member
    2014 Award

    I wouldn't assume that, and although assumptions are bad, a safer assumption would be that any data it did give beyond its range would be erratic/imprecise. And as DH said this poor data could cloud/corrupt your results

  11. I did not realize a spectrometer could give data outside its range. I have never worked with one before today.
    This is something I will be looking into.

    I would appreciate it if we could stick to discussing how to decide to throw out data.
  12. SteamKing

    SteamKing 11,019
    Staff Emeritus
    Science Advisor
    Homework Helper

    You've got a flaky instrument. IMO, you can't trust the readings beyond a wavelength of about 0.9 micrometers.
  13. D H

    Staff: Mentor

    That is a bad assumption. A spectrometer is not smart. The burden of using it correctly falls on the user, not the machine. This is very typical of lab equipment.

    That you made this assumption and that this is the first time you used a spectrometer opens the door to a new explanation of your results: Operator error.

    Yet another explanation is that the instructor intentionally specified the experiment in a way that would use the equipment outside of its spec range. This is a good teaching moment, after all.

    Edit: Spec range is short for specification range, the range over which the manufacturer guarantees results will be accurate.

    You may well have to throw out all of the data and start over. Make sure you read the operators manual so you know how to use it correctly, read the spec sheet so you know the limitations of the machine, and calibrate if necessary (as described in the operators manual).
    Last edited: Sep 10, 2013
  14. ZapperZ

    ZapperZ 30,743
    Staff Emeritus
    Science Advisor
    Education Advisor

    I'm going to approach this from the point of view of an experimentalist in the audience. I would like to see the full data first, and then, see what you had done in your analysis. And I think, this is typically a good approach.

    Assuming that you have no plausible explanation for the deviation, and assuming that you have no clue on an explanation to justify not counting those "naughty" data points, what you should do is

    (i) present the entire data and do a fit on the data.

    (ii) then show that if you do not include the rogue data points, it is a better fit.

    This way, you are being honest. You may tell them that, at this point, you do not have a good explanation on why there are data points that do not follow the description. This way, you have not thrown away any data points in your presentation, and you've given your audience the ability to decide for themselves if they want to accept those data points. But at the very least, you are not hiding from them things that you can't accommodate or explain.

    However, if you do have a plausible explanation, and I'm seeing several posts in this thread on what possibly could have happened, then you can use such an explanation to justify why you can ignore those data points, and produce a better fit with the remaining ones. In either case, always, ALWAYS show all of your data (unless you know for sure some of them are "garbage" because of obvious errors in measurements, etc.). In many areas of physics, even instrumental errors can be interesting and illuminating, especially in research on device physics.

  15. How do you know that you are "near the limits"? If you don't have the manual, with a bit of luck you can find the specs on Internet (and the manual, as DH remarked!). If parts of your measurements are out of the specified range (or if they were wrong), you can either drop those parts or better, just dash some of it and explain that for the analysis you will ignore the measurement data that were out of spec. But if that doesn't help (or if you can't find the specs!), then remains the suggestion by ZapperZ.
  16. Sounds like a filter. The characteristics of the filter should be documented.

    You need to report all of the data. Failure to report all your data is a cardinal scientific sin, akin to plagiarism.

    However, the statistical methods you use are up to you. As long as you clearly explain your inclusion and exclusion criteria and motivation, then it should be fine. People can judge whether they like your exclusion criteria and your reasoning.

    Few people will like "We excluded the last N data points to make our fit better". More people will like "We excluded the last N data points to avoid bias from the spectrometers XYZ filter".

    Check out the advice here, particularly the second paper.
    Last edited: Sep 10, 2013
  17. f95toli

    f95toli 2,473
    Science Advisor
    Gold Member

    I don't think there are any precise rules for when you can discard data. I also think people who say "always include all data" are a bit naive. In my experience (I am an experimental physicist), this hardly ever happens in real experiments. The reason being that if you are doing a "real" experiment (as opposed to a experiment at school) you end up spending a LOT of time tweaking instrumentation etc, meaning most runs wil just produce sets of data where you -the experimentalist- know there is a problem, or something is not optimal.
    Sometimes the problem is obvious (e.g. an amplifier saturated) and sometimes the problem is not so easy to find (e.g. a dodgy cable).

    It is when the problem is not so obvious that you have to use your judgement; and whenever you can you should always specify the criteria/reason you used to discard point.

    Example: we are currently working on a paper where we are presenting some very tricky measurements. One of the graphs summarizes in there has about 50 data points or so, and is the result of hundreds of hours of measurements (one dataset results in one point). Now, one difficulty with these measurements is that they rely on the temperature of the environment being very stable. Unfortunately our temperature control is not perfect, and now and the temperature starts to drift, which gives a "false" reading (somtimes the curve looks funny, but occasinally it looks "real")
    If we were to include datapoints obtained when the temperature is drifting the results in the paper would look quite different, but fortunately we can post-process our data and see if this has happened in which case the data is discarded. Hence, in the end we've probably discarded somehing like 25% of our data, but the key point is that we in the paper explicity state this, AND we specify the method we use to decide which points to discard.
  18. russ_watters

    Staff: Mentor

    I'm not certain about spectrometers, but most equipment i've used has measurement accuracy expressed as a fraction of full scale, so as the readings get lower, the % error goes up.
  19. AlephZero

    AlephZero 7,248
    Science Advisor
    Homework Helper

    Yes .... but the key word in that sentence is data.

    if you used an instrument outside of its working range, or without calibration, or whatever, you don't have data. You have random numbers.

    If you take the "report" everything" principle to extremes, you might end up publishing nonsense like this (which would be hilarious if it wasn't so serious): http://www.bbc.co.uk/news/world-asia-23918882
    (I'm not sure about the words "more sensitive" in their explanation, but that's a different quibble)
  20. Oh no that's really bad! Indeed, it's very bad to report invalid data without warning.

    Two quibbles: invalid data is rarely "random" (as your example nicely illustrates) and indeed "more sensitive" is wrongly formulated.
  21. Another key word is "report". That doesn't mean that you need to analyze it, but you do need to report it (or at least report that you got it) and explain why you didn't analyze it. Like f95toli said:
    Last edited: Sep 10, 2013
Know someone interested in this topic? Share this thead via email, Google+, Twitter, or Facebook

Have something to add?

Draft saved Draft deleted
: spectrometer