When is it ok to exclude data for a model to fit better?

elegysix · Sep 9, 2013

Hello people,
I'm working on a research project involving the modelling of a black body's radiation spectrum, and near the limits of the spectrometer, the intensity goes to zero rather sharply.

I cannot directly say it is error from the spectrometer, because I simply don't know. However, I know it isn't what the model should look like.

Here is the dilemma, when I fit all 1100 points, I have an r^2 value of roughly .733, which is bad.
Yet If I exclude about 5-10% of the data on each end before I do the regression, I get an r^2 around .995 which is ideal.

So the question is: should I exclude them, and how much excluded data is reasonable?

I will be presenting this at a university, and I am assuming I will be questioned on my methods here. So how do I reason this? Is there a rule of thumb for this sort of thing?

Thanks,
Eq

SteamKing · Sep 9, 2013

Sometimes, you take too much data. You may not be able to explain why each suspect data point falls out of range. With 1100 data points, excluding ten percent still leaves approx. 1000 data points.

Rather than submerge the project in a tidal wave of data, perhaps you should do a series of trial data collections, where you take data from the entire range of the spectrum, but do it in several independent series. In this fashion, if something hasn't warmed up or whatever, its effect can be easily separated out from the rest of the data.

Without knowing more about your experiment, it is hard to offer further suggestions.

Vanadium 50 · Sep 9, 2013

There's this: http://en.wikipedia.org/wiki/Chauvenet's_criterion

However, it will only remove a few points: not 50 or 100.

elegysix · Sep 9, 2013

Here is more on the experiment.
Took a spectrometer to a light bulb and recorded the data.
Fit the data to this function:
S_λ = 2πc²hα / λ⁵(e^hc/λkT -1) + constant

Here are two plots illustrating the difference this makes.
The data shown is raw, the difference between the plots is whether data is excluded before the fitting.
Data points outside the blue dotted lines were not used when fitting.

D H · Sep 9, 2013

Throwing out data is a bit of an art. Throw out too much and you run the risk of doctoring the data, of seeing a conclusion that is not justified. Use all of it, and as you have seen, you run the risk of bad data corrupting your results.

Given that double peak, coupled with the bad fit, it looks as if you are using the spectrometer beyond the range it should be used in. Have you looked at the spectrometer's specification manual?

elegysix · Sep 9, 2013

I assumed it wouldn't gather data beyond it's usable range. But that's worth looking into. thanks.

Khashishi · Sep 9, 2013

You need to find out more about your spectrometer. Do you have recent calibration data? Take some shots of a calibrated lamp

SteamKing · Sep 9, 2013

Definitely falls into the category, 'Something's messed up with the measurement device.'

It's also bad practice to think that you can rely on your instruments to follow good experimental procedure when you can't be bothered to. ('I assumed it wouldn't gather data beyond it's (sic) usable range.')

davenn · Sep 9, 2013

elegysix said:

I assumed it wouldn't gather data beyond it's usable range. But that's worth looking into. thanks.

I wouldn't assume that, and although assumptions are bad, a safer assumption would be that any data it did give beyond its range would be erratic/imprecise. And as DH said this poor data could cloud/corrupt your results

Dave

elegysix · Sep 9, 2013

I did not realize a spectrometer could give data outside its range. I have never worked with one before today.
This is something I will be looking into.

I would appreciate it if we could stick to discussing how to decide to throw out data.

SteamKing · Sep 9, 2013

You've got a flaky instrument. IMO, you can't trust the readings beyond a wavelength of about 0.9 micrometers.

D H · Sep 10, 2013

elegysix said:

I did not realize a spectrometer could give data outside its range. I have never worked with one before today.

That is a bad assumption. A spectrometer is not smart. The burden of using it correctly falls on the user, not the machine. This is very typical of lab equipment.

That you made this assumption and that this is the first time you used a spectrometer opens the door to a new explanation of your results: Operator error.

Yet another explanation is that the instructor intentionally specified the experiment in a way that would use the equipment outside of its spec range. This is a good teaching moment, after all.Edit: Spec range is short for specification range, the range over which the manufacturer guarantees results will be accurate.

I would appreciate it if we could stick to discussing how to decide to throw out data.

You may well have to throw out all of the data and start over. Make sure you read the operators manual so you know how to use it correctly, read the spec sheet so you know the limitations of the machine, and calibrate if necessary (as described in the operators manual).

ZapperZ · Sep 10, 2013

elegysix said:

Hello people,
I'm working on a research project involving the modelling of a black body's radiation spectrum, and near the limits of the spectrometer, the intensity goes to zero rather sharply.

I cannot directly say it is error from the spectrometer, because I simply don't know. However, I know it isn't what the model should look like.

Here is the dilemma, when I fit all 1100 points, I have an r^2 value of roughly .733, which is bad.
Yet If I exclude about 5-10% of the data on each end before I do the regression, I get an r^2 around .995 which is ideal.

So the question is: should I exclude them, and how much excluded data is reasonable?

I will be presenting this at a university, and I am assuming I will be questioned on my methods here. So how do I reason this? Is there a rule of thumb for this sort of thing?

Thanks,
Eq

I'm going to approach this from the point of view of an experimentalist in the audience. I would like to see the full data first, and then, see what you had done in your analysis. And I think, this is typically a good approach.

Assuming that you have no plausible explanation for the deviation, and assuming that you have no clue on an explanation to justify not counting those "naughty" data points, what you should do is

(i) present the entire data and do a fit on the data.

(ii) then show that if you do not include the rogue data points, it is a better fit.

This way, you are being honest. You may tell them that, at this point, you do not have a good explanation on why there are data points that do not follow the description. This way, you have not thrown away any data points in your presentation, and you've given your audience the ability to decide for themselves if they want to accept those data points. But at the very least, you are not hiding from them things that you can't accommodate or explain.

However, if you do have a plausible explanation, and I'm seeing several posts in this thread on what possibly could have happened, then you can use such an explanation to justify why you can ignore those data points, and produce a better fit with the remaining ones. In either case, always, ALWAYS show all of your data (unless you know for sure some of them are "garbage" because of obvious errors in measurements, etc.). In many areas of physics, even instrumental errors can be interesting and illuminating, especially in research on device physics.

Zz.

harrylin · Sep 10, 2013

elegysix said:

Hello people,
I'm working on a research project involving the modelling of a black body's radiation spectrum, and near the limits of the spectrometer, the intensity goes to zero rather sharply.
[..]

How do you know that you are "near the limits"? If you don't have the manual, with a bit of luck you can find the specs on Internet (and the manual, as DH remarked!). If parts of your measurements are out of the specified range (or if they were wrong), you can either drop those parts or better, just dash some of it and explain that for the analysis you will ignore the measurement data that were out of spec. But if that doesn't help (or if you can't find the specs!), then remains the suggestion by ZapperZ.

Dale · Sep 10, 2013

elegysix said:

Hello people,
I'm working on a research project involving the modelling of a black body's radiation spectrum, and near the limits of the spectrometer, the intensity goes to zero rather sharply.

Sounds like a filter. The characteristics of the filter should be documented.

elegysix said:

So the question is: should I exclude them, and how much excluded data is reasonable?

You need to report all of the data. Failure to report all your data is a cardinal scientific sin, akin to plagiarism.

However, the statistical methods you use are up to you. As long as you clearly explain your inclusion and exclusion criteria and motivation, then it should be fine. People can judge whether they like your exclusion criteria and your reasoning.

Few people will like "We excluded the last N data points to make our fit better". More people will like "We excluded the last N data points to avoid bias from the spectrometers XYZ filter".

Check out the advice here, particularly the second paper.
http://www.aai.org/about/Publications/Additional/Docs/AAI_Dos_Donts.pdf

f95toli · Sep 10, 2013

I don't think there are any precise rules for when you can discard data. I also think people who say "always include all data" are a bit naive. In my experience (I am an experimental physicist), this hardly ever happens in real experiments. The reason being that if you are doing a "real" experiment (as opposed to a experiment at school) you end up spending a LOT of time tweaking instrumentation etc, meaning most runs wil just produce sets of data where you -the experimentalist- know there is a problem, or something is not optimal.
Sometimes the problem is obvious (e.g. an amplifier saturated) and sometimes the problem is not so easy to find (e.g. a dodgy cable).

It is when the problem is not so obvious that you have to use your judgement; and whenever you can you should always specify the criteria/reason you used to discard point.

Example: we are currently working on a paper where we are presenting some very tricky measurements. One of the graphs summarizes in there has about 50 data points or so, and is the result of hundreds of hours of measurements (one dataset results in one point). Now, one difficulty with these measurements is that they rely on the temperature of the environment being very stable. Unfortunately our temperature control is not perfect, and now and the temperature starts to drift, which gives a "false" reading (somtimes the curve looks funny, but occasinally it looks "real")
If we were to include datapoints obtained when the temperature is drifting the results in the paper would look quite different, but fortunately we can post-process our data and see if this has happened in which case the data is discarded. Hence, in the end we've probably discarded somehing like 25% of our data, but the key point is that we in the paper explicity state this, AND we specify the method we use to decide which points to discard.

russ_watters · Sep 10, 2013

I'm not certain about spectrometers, but most equipment I've used has measurement accuracy expressed as a fraction of full scale, so as the readings get lower, the % error goes up.

AlephZero · Sep 10, 2013

DaleSpam said:

You need to report all of the data. Failure to report all your data is a cardinal scientific sin, akin to plagiarism.

Yes ... but the key word in that sentence is data.

if you used an instrument outside of its working range, or without calibration, or whatever, you don't have data. You have random numbers.

If you take the "report" everything" principle to extremes, you might end up publishing nonsense like this (which would be hilarious if it wasn't so serious): http://www.bbc.co.uk/news/world-asia-23918882

The Tokyo Electric Power Company (Tepco) had originally said the radiation emitted by the leaking water was around 100 millisieverts an hour.

However, the company said the equipment used to make that recording could only read measurements of up to 100 millisieverts.

The new recording, using a more sensitive device, showed a level of 1,800 millisieverts an hour.

(I'm not sure about the words "more sensitive" in their explanation, but that's a different quibble)

harrylin · Sep 10, 2013

AlephZero said:

Yes ... but the key word in that sentence is data.

if you used an instrument outside of its working range, or without calibration, or whatever, you don't have data. You have random numbers.

If you take the "report" everything" principle to extremes, you might end up publishing nonsense like this (which would be hilarious if it wasn't so serious): http://www.bbc.co.uk/news/world-asia-23918882

(I'm not sure about the words "more sensitive" in their explanation, but that's a different quibble)

Oh no that's really bad! Indeed, it's very bad to report invalid data without warning.

Two quibbles: invalid data is rarely "random" (as your example nicely illustrates) and indeed "more sensitive" is wrongly formulated.

Dale · Sep 10, 2013

AlephZero said:

Yes ... but the key word in that sentence is data.

Another key word is "report". That doesn't mean that you need to analyze it, but you do need to report it (or at least report that you got it) and explain why you didn't analyze it. Like f95toli said:

f95toli said:

in the end we've probably discarded somehing like 25% of our data, but the key point is that we in the paper explicity state this, AND we specify the method we use to decide which points to discard.

russ_watters · Sep 10, 2013

DaleSpam said:

Another key word is "report". That doesn't mean that you need to analyze it, but you do need to report it (or at least report that you got it) and explain why you didn't analyze it. Like f95toli said:

Not pedantic: Please define "report". I would state what data is reported and why, but I wouldn't show data outside of that range, for data that (should be) known before it is recorded to be bad. The logic being that data known to be invalid before it is recorded should never be recorded in the first place.

Dale · Sep 10, 2013

russ_watters said:

Not pedantic: Please define "report". I would state what data is reported and why, but I wouldn't show data outside of that range, for data that (should be) known before it is recorded to be bad. The logic being that data known to be invalid before it is recorded should never be recorded in the first place.

No worries, it is always reasonable to ask for clarification of terms.

By "report" I just mean a sentence in the manuscript that mentions the existence of any data collected, including data that was excluded from analysis. Reporting need not include any plots or summary statistics or other details or analysis, but all data should be mentioned in the report, that is what I mean by "report".

russ_watters · Sep 10, 2013

DaleSpam said:

By "report" I just mean a sentence in the manuscript that mentions the existence of any data collected, including data that was excluded from analysis. Reporting need not include any plots or summary statistics or other details or analysis, but all data should be mentioned in the report, that is what I mean by "report".

We're in agreement.

I'll still explain my personal example of what I mean though:

I'm an HVAC engineer and often deal with databases of recorded control/sensor points. For any particular facility (campus) you may have thousands of recorded points, recorded as often as every second, for the life of the facility. It is a lot of data (for me to analyze with Excel and for them to store and collect). Regardless of what a particular sensor is capable of, they are all set up to record what is useful. If a sensor is capable of 16 bit precision but 8 bit is all that is needed, there is no point in recording 8 bit. If a fan is capable of 10,000 CFM of airflow and stalls below 5,000, (and the stability is only +-10 CFM) there is no point in setting the recording range equal to the sensor's readout of 0-65,000 +-1.

Readings that are known to be nonsense are pre-filtered to save database space.

Now I'm an engineer not a scientist, so I don't typically report this filtering (I'm not writing papers), but I am always prepared to defend it if necessary. I deal with proof of energy savings to the government and while there may be hundreds of thousands of dollars of incentive money resting on one calculation/measurement, the depth of the review process is at the discretion of the multiple layers of auditors who check it. The government sets the reporting requirements there.

ehild · Sep 11, 2013

elegysix said:

Here is more on the experiment.
Took a spectrometer to a light bulb and recorded the data.
Fit the data to this function:
S_λ = 2πc²hα / λ⁵(e^hc/λkT -1) + constant

You said the source was a light bulb. The bulb was made from ordinary glass, I guess, instead of quartz glass. The light bulb certainly absorbed UV radiation. So you have to discard the UV data.
Also the atmospheric gases -water vapour, ozone, absorb radiation in the visible. The dip in your spectra are from that absorption.
The detector of the spectrometer also has got sensitivity depending on the wavelength. See the response of a silicon detector.

http://upload.wikimedia.org/wikipedia/commons/4/41/Response_silicon_photodiode.svg

So the recorded data are the product of the intensity multiplied by the sensitivity of the detector.
You have to perform calibration - record a spectrum with a standard source, and modify your measured data accordingly. I suggest to use data measured in the range 400-1000 nm.

When showing measured data together with the fit, show only the range where you performed fitting.

ehild

elegysix · Sep 11, 2013

ehild said:

So the recorded data are the product of the intensity multiplied by the sensitivity of the detector.
You have to perform calibration - record a spectrum with a standard source, and modify your measured data accordingly. I suggest to use data measured in the range 400-1000 nm.

If memory serves me correctly, the spectrometer is a QE65PRO from ocean optics, was purchased last month and calibrated by them (They state they do that). And the calibration data is stored & processed by the spectrometer. So, if I happen to be extremely fortunate, I may not have to.
Regardless, I will at least check to make sure. I probably will end up doing a calibration like that.
I am also planning to exclude small ranges corresponding to absorption lines, like you mentioned (not that far into it yet).

From what I've gathered on here and from OO's website, the usable domain of wavelengths depends on the grating in the spectrometer and the slit diameter. I don't know how to find these two things out (without taking the spectrometer apart), so I'll be calling them tomorrow or friday to ask what it was shipped with.

ehild · Sep 12, 2013

The range of the detector of your spectrometer is 200-1100 nm. Basically the spectrometer is used to scan transmittance/absorbance, and this one also measures Raman and flourescence spectra. As you measure the emission of a light bulb, you used the spectrometer in fluorescence mode, I guess.
When you measure with a spectrometer, you always have to scan a base line first, and then correct the row data with it. Read the handbook of the spectrometer how to perform the measurement.

ehild

When is it ok to exclude data for a model to fit better?

1. How do I determine if excluding data is necessary for my model to fit better?

2. Is it ever acceptable to exclude data for convenience or to support a desired outcome?

3. Are there any ethical considerations when it comes to excluding data for model fitting?

4. Can excluding data improve the accuracy of a model?

5. How should I document and report any excluded data in my research?

Similar threads

Hot Threads

Recent Insights