Sensitivity analysis, missing data, and hypothesis testing

Click For Summary

Discussion Overview

The discussion revolves around the challenges of conducting hypothesis testing with a dataset that contains less than 10% missing values, specifically focusing on the implications of missing data on the analysis of survival time for land parcels. The participants explore methods for handling missing data in the context of fitting a hazard model and testing hypotheses related to predictor variables.

Discussion Character

  • Exploratory, Technical explanation, Debate/contested

Main Points Raised

  • One participant describes a method for assigning values to missing data in order to minimize the chance of rejecting a null hypothesis related to the slope coefficient of topographic slope.
  • Another participant acknowledges the non-standard nature of this method but does not identify any logical pitfalls, suggesting that multivariate analysis could be approached either separately for each parameter or through a joint test.
  • A suggestion is made to consider using a censored data model, such as a Tobit model, as an alternative approach.
  • One participant inquires about the observed characteristics of the missing observations in relation to the sample mean, indicating a potential avenue for further exploration.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best approach to handle the missing data, with multiple competing views and methods being proposed. The discussion remains unresolved regarding the optimal strategy for hypothesis testing in this context.

Contextual Notes

The discussion highlights the complexity introduced by multivariate models and the potential implications of the missing data mechanism, which is suggested to be not missing at random. There are also considerations regarding the assumptions made when assigning values to missing data.

wvguy8258
Messages
48
Reaction score
0
Hi,

I have a large data set with with less than 10% missing values (missing response, but all predictor variables present). It is a near certainty that these values are not missing at random, dependent upon the missing value. The response is survival time of a land parcel with 'death' being development of the parcel. Predictors are things like average topographic slope in the parcel etc. I plan to fit a hazard model to the data to test hypotheses related to the sign and magnitude of slope coefficients. I've read a bit about methods for dealing with missing data, but I feel that because I am primarily interested in testing hypotheses that a simpler method may be available that I haven't yet seen in print. I am here asking for advice on the feasibility of the simple idea to follow, how it can be improved, and if anyone has any pertinent references to share.

The survival time is bounded. I am taking the beginning of colonization of the area as the beginning of the study period and the present as its end. So, the response variable is bounded between zero and 2009-time of first colonization. Let's say I have a very simple hypothesis that the slope coefficient of topographic slope is less than zero, so my null hypothesis is that it is greater than or equal to zero. It seems that I could pick values for the missing data so as to minimize the chance of rejecting this null hypothesis. If I still find evidence to reject the null under this extreme example, then it is reasonable to conclude that the full data set, if missing values were also observed, would likewise lead to this rejection. So, in the example of topographic slope I would assign missing data values that would give the largest slope coefficient (and the smallest variance at a high parameter estimate? less sure of how to think here) possible given the observed data. First, are there any logical pits I am falling into here? This seems rather straightforward with only one predictor in the model, but I suspect that a multivariate model will complicate things. Should each hypothesis considered (corresponding to each slope coefficient of interest) be considered separately? Meaning, should I concoct a series of missing values to try and not reject the null associated with hypothesis 1, start over and do the same thing for hypothesis 2, etc? Or should this be done at once? If you think this is all a bad idea, in a few words, how would you go about modeling the data I've described? Thanks. -seth
 
Physics news on Phys.org
Yours is a non-standard method but I don't see a logical pitfall. For multivar analysis my guess is you can do it separately for each parameter you are testing; or you can set up a joint test (e.g. an F test) that encompasses all of the individual tests, and then assign values to minimize that single, overarching test statistic.

Another approach may be to use a censored data ("Tobit") model.
 
Thank you, I have been reading some about models for censored data.
 
When you look at the observed characteristics of the "missing" observations, do you see a marked difference from the sample mean?
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
26
Views
3K
  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K