Register to reply

When is it ok to exclude data for a model to fit better?

Share this thread:
harrylin
#19
Sep10-13, 01:05 PM
P: 3,187
Quote Quote by AlephZero View Post
Yes .... but the key word in that sentence is data.

if you used an instrument outside of its working range, or without calibration, or whatever, you don't have data. You have random numbers.

If you take the "report" everything" principle to extremes, you might end up publishing nonsense like this (which would be hilarious if it wasn't so serious): http://www.bbc.co.uk/news/world-asia-23918882


(I'm not sure about the words "more sensitive" in their explanation, but that's a different quibble)
Oh no that's really bad! Indeed, it's very bad to report invalid data without warning.

Two quibbles: invalid data is rarely "random" (as your example nicely illustrates) and indeed "more sensitive" is wrongly formulated.
DaleSpam
#20
Sep10-13, 01:26 PM
Mentor
P: 17,330
Quote Quote by AlephZero View Post
Yes .... but the key word in that sentence is data.
Another key word is "report". That doesn't mean that you need to analyze it, but you do need to report it (or at least report that you got it) and explain why you didn't analyze it. Like f95toli said:
Quote Quote by f95toli View Post
in the end we've probably discarded somehing like 25% of our data, but the key point is that we in the paper explicity state this, AND we specify the method we use to decide which points to discard.
russ_watters
#21
Sep10-13, 06:58 PM
Mentor
P: 22,302
Quote Quote by DaleSpam View Post
Another key word is "report". That doesn't mean that you need to analyze it, but you do need to report it (or at least report that you got it) and explain why you didn't analyze it. Like f95toli said:
Not pedantic: Please define "report". I would state what data is reported and why, but I wouldn't show data outside of that range, for data that (should be) known before it is recorded to be bad. The logic being that data known to be invalid before it is recorded should never be recorded in the first place.
DaleSpam
#22
Sep10-13, 07:05 PM
Mentor
P: 17,330
Quote Quote by russ_watters View Post
Not pedantic: Please define "report". I would state what data is reported and why, but I wouldn't show data outside of that range, for data that (should be) known before it is recorded to be bad. The logic being that data known to be invalid before it is recorded should never be recorded in the first place.
No worries, it is always reasonable to ask for clarification of terms.

By "report" I just mean a sentence in the manuscript that mentions the existence of any data collected, including data that was excluded from analysis. Reporting need not include any plots or summary statistics or other details or analysis, but all data should be mentioned in the report, that is what I mean by "report".
russ_watters
#23
Sep10-13, 07:25 PM
Mentor
P: 22,302
Quote Quote by DaleSpam View Post
By "report" I just mean a sentence in the manuscript that mentions the existence of any data collected, including data that was excluded from analysis. Reporting need not include any plots or summary statistics or other details or analysis, but all data should be mentioned in the report, that is what I mean by "report".
We're in agreement.

I'll still explain my personal example of what I mean though:

I'm an HVAC engineer and often deal with databases of recorded control/sensor points. For any particular facility (campus) you may have thousands of recorded points, recorded as often as every second, for the life of the facility. It is a lot of data (for me to analyze with Excel and for them to store and collect). Regardless of what a particular sensor is capable of, they are all set up to record what is useful. If a sensor is capable of 16 bit precision but 8 bit is all that is needed, there is no point in recording 8 bit. If a fan is capable of 10,000 CFM of airflow and stalls below 5,000, (and the stability is only +-10 CFM) there is no point in setting the recording range equal to the sensor's readout of 0-65,000 +-1.

Readings that are known to be nonsense are pre-filtered to save database space.

Now I'm an engineer not a scientist, so I don't typically report this filtering (I'm not writing papers), but I am always prepared to defend it if necessary. I deal with proof of energy savings to the government and while there may be hundreds of thousands of dollars of incentive money resting on one calculation/measurement, the depth of the review process is at the discretion of the multiple layers of auditors who check it. The government sets the reporting requirements there.
ehild
#24
Sep11-13, 01:31 AM
HW Helper
Thanks
P: 10,666
Quote Quote by elegysix View Post
Here is more on the experiment.
Took a spectrometer to a light bulb and recorded the data.
Fit the data to this function:
Sλ = 2πc2hα / λ5(ehc/λkT -1) + constant
You said the source was a light bulb. The bulb was made from ordinary glass, I guess, instead of quartz glass. The light bulb certainly absorbed UV radiation. So you have to discard the UV data.
Also the atmospheric gases -water vapour, ozone, absorb radiation in the visible. The dip in your spectra are from that absorption.
The detector of the spectrometer also has got sensitivity depending on the wavelength. See the response of a silicon detector.



So the recorded data are the product of the intensity multiplied by the sensitivity of the detector.
You have to perform calibration - record a spectrum with a standard source, and modify your measured data accordingly. I suggest to use data measured in the range 400-1000 nm.

When showing measured data together with the fit, show only the range where you performed fitting.

ehild
elegysix
#25
Sep11-13, 09:43 PM
P: 316
Quote Quote by ehild View Post
So the recorded data are the product of the intensity multiplied by the sensitivity of the detector.
You have to perform calibration - record a spectrum with a standard source, and modify your measured data accordingly. I suggest to use data measured in the range 400-1000 nm.
If memory serves me correctly, the spectrometer is a QE65PRO from ocean optics, was purchased last month and calibrated by them (They state they do that). And the calibration data is stored & processed by the spectrometer. So, if I happen to be extremely fortunate, I may not have to.
Regardless, I will at least check to make sure. I probably will end up doing a calibration like that.
I am also planning to exclude small ranges corresponding to absorption lines, like you mentioned (not that far into it yet).

From what I've gathered on here and from OO's website, the usable domain of wavelengths depends on the grating in the spectrometer and the slit diameter. I don't know how to find these two things out (without taking the spectrometer apart), so I'll be calling them tomorrow or friday to ask what it was shipped with.
ehild
#26
Sep12-13, 12:12 AM
HW Helper
Thanks
P: 10,666
The range of the detector of your spectrometer is 200-1100 nm. Basically the spectrometer is used to scan transmittance/absorbance, and this one also measures Raman and flourescence spectra. As you measure the emission of a light bulb, you used the spectrometer in fluorescence mode, I guess.
When you measure with a spectrometer, you always have to scan a base line first, and then correct the row data with it. Read the handbook of the spectrometer how to perform the measurement.

ehild


Register to reply

Related Discussions
Fitting a model to a data Precalculus Mathematics Homework 2
Best model to fit data Set Theory, Logic, Probability, Statistics 13
The primary key of a relationship set in an E-R data model Programming & Computer Science 0
Writing a model to represent data set ? Set Theory, Logic, Probability, Statistics 2
Model used for exoplanet data Astronomy & Astrophysics 13