A Analyzing Curvature with Confidence Bands?

1. Jul 14, 2017

FallenApple

So lets say that there are several different treatments for being underweight. One way to deal with this is to make repeated measurements of each patient's weight across time. So I can fit a model to the panel data, regressing weight on treatment as an indicator variable, and then plot the different fitted mean trajectories along with their point wise confidence bands. Basically a plot of response vs time for all groups but accounting for the uncertainty of the paths as well. Also, say that scatterplot of suggests that curves might not linear so I have to do some sort of polynomial interpolation on the right hand side of the equation.

Moreover, say that I notice that some of the bands are convex for some treatment groups and some are not for others. Can I interpret this as some of the treatments have accelerating returns and some have diminishing returns?

I'm thinking that even if a confidence band for a particular trajectory is convex, I can imagine the true path being concave as a possible path within the band if the bands are particularly wide. Even if a convex shaped band were not wide, the true path may be concave for short intervals of time.

2. Jul 14, 2017

Staff: Mentor

Do you have any a theoretical reason to expect a quadratic response? It is usually a bad idea to fit a complicated model without a theoretical motivation

3. Jul 14, 2017

andrewkirk

If I understand you correctly you are doing an OLS regression of weight (continuous variable) against time (continuous variable) and $n-1$ logical variables, being flags to indicate which treatment is being used, where there are $n$ treatments and one is chosen as the base. Let these variables be $W_t^j$ for weight at time $t$ for patient $j$, and $R_k^j$ is true iff patient $j$ is receiving treatment $k$. Let $\vec R^j$ be the vector with components $R_1^j,...,R_{n-1}^j$, that specifies which treatment patient $j$ is receiving.

If the relationship of weight to time appears significantly non-linear, any confidence bands for $E\left[W_t^j\ |\ \vec R^j=\vec r\right]$ generated by basic OLS regression will be of little use.

A way to remedy this is to fit a reasonably nice non-linear function $f_k$ to the map
$$t\mapsto \hat W_{t,k}\triangleq E\left[W_t^j\ |\ \vec R^j=\vec r_k\right]$$
for each $k\in\{0,...,n-1\}$ where $vec r_k$ is the vector of flags for which only the $k$th flag is True (this identifies Treatment $k$), for $k>0$, and $\vec r_k=0$ is the vector in which all flags are False (this identifies the Base treatment).

A common way to fit the functions $f_k$ is cubic splines.

We then regress $W_t^j$ against $\vec R^j$ and $g(t,j)$ where $g(t,j)=f_{K_j}(t)$ and $K(j)$ is the treatment being applied to patient $j$, where 0 denotes the Base treatment.

If the splining process has succeeded, this regression will give reasonably linear-looking maps

$$f_k(t) \mapsto \hat W_{t,k}$$

and the confidence intervals should be meaningful.

EDIT: But having posted that (it took a while to write the latex), I agree with Dale's much more succinct response, which has come in while I was writing this.

4. Jul 16, 2017

FallenApple

Not particularly. So I guess if there is no reason, then we would just use linear? For parsimony out of concern for overfitting? What about the results from a scatterplot beforehand?

Also, what if I changed the question and am interested in the serum hormone level in the blood stream over time? Then we know that the levels go up during treatment, then go down after stopping treatment, so the spike and decrease can be modeled by combination of linear splines or a concave quadratic.

Or say I am doing a study on THC levels in cannabis smokers over time( hypothetically. not sure how ethical/common is it to get healthly people to volunteer for something like this). They start smoking and then quit during the middle of the study. So in that case, we expect from theory that THC would stay in the system, and then over time, go down. In this case, would we fit a concave quadratic curve? Since in this case, from theory, we wouldn't expect to overfit since future value presumably would follow the same trend based on scientific theory.

Last edited: Jul 16, 2017
5. Jul 16, 2017

FallenApple

So, from theory, I can determine if to include a curve or not. So say I fit a curve or splines in the model such that it is not linear. Can I just directly obtain the estimated variance covariance matrix of the model, ( vcov(model) in R), and then from that build up the confidence bands with loops across the time points, and do this for each of the treatment groups? That way I get to see the type of curvature that is manifested by the bands? So say there are 12 time points. From the variance covariance matrix, and the fitted estimates, I can get the CI for the response at time 1 for the base group, then I would do this again for all the other times, finally, I plot the CI's vertically at each time point.

Next, I do this for each group.

Is there a reason why the CI's constructed manually would be off?

I think from your method, I could determine which of the spline coefficients are significant or not, and get inference on whether there is any curvature or not. But would it tell me the direction of the curvature, if its accelerating or not? So if I fit a concave cubic, and it turns out to be statistically significant for say, group2(by doing group and time interaction), then that means that group 2 estimated response levels are decelerating?

But if everything is linearized, then how would I visualize the trajectories from the model?

6. Jul 16, 2017

Staff: Mentor

Yes, that would be my recommendation.

I am not an expert in pharmacokinetics, but I know that they have some fairly standard models that are used. Using one of those would give a good theoretical justification for a more complicated model

7. Jul 16, 2017

FallenApple

Here is what I mean. This is an example of simulated data, and building of the confidence bands from scratch from the model estimates and the estimated variance covariance matrix for the model.

Code (Text):

#fitting curve

x <- c(1:24)

mu <- 10 + 5 * sin( x * pi / 24 )  - 2 * cos( (x-6)*4/24 )

# errors

set.seed(2010)
eee <- rnorm( length(mu) )

# data

y <- mu + eee

x.squared <- x^2
x.cubed <- x^3

fit <- lm( y ~ x + x.squared + x.cubed )

print( summary( fit ) )

# Look at the fitted model:

fitted.mean <- predict( fit )
# Plot

#preliminary plot
plot(x,y)
lines( x, mu, col="red" )
lines( x, fitted.mean, col="blue", lwd=2 )
title("Data, true mean curve (red), and fitted (blue) using cubic")

#estimated variance covariance matrix of model
vcov(fit)

#vector of regression parameter estimates
B=as.numeric(fit\$coefficients)

#function for getting the estimated response variable at any x
estimate=function(x){

return(B[1]+B[2]*x+B[3]*x^2+B[4]*x^3   )
}

v=NULL

#building vector of estimates to plot
for(i in 1:24){

v=c(v,estimate(i))

}

summary(fit)

n=24
p=4
qt(.975,n-p)

#function for getting the confidence interval for response variable at any x
CI=function(x){

k=c(1,x,x^2,x^3 )

return( t(k)%*%B+qt(.975,n-p)*c(-1,1)*sqrt(t(k)%*%vcov(fit)%*%k))
}

CI(5)

conf=NULL

#building matrix of CIs to plot
for(i in 1:24){

conf=rbind(conf,CI(i))

}

conf[1,]

#plotting it all together.
plot(x,y)
lines( x, mu, col="red" )
lines(x,v,lwd=3)
lines( x, conf[,1], col="darkred", lwd=2 )
lines( x, conf[,2], col="darkred", lwd=2 )

title("Data, fitted using cubic")
legend("topleft",c("fitted mean curve","confidence band","true mean curve"), col=c("black","darkred","red"),
lty=1,cex=.5)

8. Jul 16, 2017

FallenApple

Ok got it. Then from there, would I just tweak the model depending to details such as residuals as usual?

Also, is the reason people often suggest to put emphasis on using theoretical concerns for model building, to over, say building the model from just what you see in the scatterplot, and the plot of sample means, is that the behavior of unseen data would often depend on the theory and prior knowledge and how well the model would extrapolate depends on this?

9. Jul 17, 2017

Staff: Mentor

It is more of a statistical concern. Suppose that you have data with a true linear relationship with some random noise. Now, if you sample four data points and do a linear fit you should get something fairly close to the line but with some residuals. However, if you fit a cubic polynomial then you will get a perfect fit with no residuals. This is called fitting to the noise.

Then, suppose you want to test how well the model fits the data. If you use the same data to test (validate) the model as you used to select the model then you will unavoidably find that it fits the data well, because the noise is not independent between model selection and validation.

So you can do a model selection from the data, but then you need to use a different set of data for the model validation. This is essentially the reason why "big data" requires large data sets.

10. Jul 25, 2017

FallenApple

That makes sense. So the noise not being independent means that I'm using the same noise as before.
And because I'm using the pattern in the data to generate the hypothesis, this becomes circular and may not generalize. So for another dataset, model selection may give a different result because the data could be different by chance.

So for big data, the issue of p-hacking is not there? That is, the dataset is so large, that model selection is done on a set that might as well be considered the population of interest. Or is it that the validation set would be large enough that analysis could be conducted on it in it's own right and avoids the risk of resampling the same noise, which could happen in a smaller data set?

Last edited: Jul 25, 2017
11. Jul 25, 2017

Staff: Mentor

No, the dataset is so large that you can split it into separate subsets and use one subset for selecting the model and the other subset for validating it.

12. Aug 1, 2017

FallenApple

Ok got it. If the test set is too small, then we could end up with less power and mathematical issues such as having too many parameters relative to observation.

So proper scientific analysis using statistics requires not only testing hypothesis using a priori knowledge to construct the model, but also to see if the model is of any use. So a model that can explain should also be able to predict.

I want to see if my understanding of data analysis is correct. Let's say that there's plenty of data, and the model fitted on the training data shows no statistical problems( fits well, homogenous variance checked etc), scientific guidelines for model structure are followed, and finally the model performs well on the test data, which ultimately implies that the model explains well and predicts well.
Does that imply that the statistics is done and the only concerns left are scientific? For example, missing a potential confounder that the science has not yet considered?