Bootstrap in Monte Carlo and the number of samples

diegzumillo · Nov 28, 2018

I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.

Stephen Tashi · Nov 29, 2018

diegzumillo said:

I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation.

The standard deviation of what?

"Standard deviation" is associated with the distribution of a random variable. What random variable are you talking about?

diegzumillo · Nov 29, 2018

Oh, I mean standard deviation of the thing I measured after the whole analysis. I didn't get in detail because it's not a direct quantity easy to detail. Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.

Stephen Tashi · Nov 29, 2018

diegzumillo said:

Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.

It's unclear what you mean and what you are doing.

If you have N independent samples of a random variable, there are estimators of the its standard deviation (e.g. the sample standard deviation) that use all N samples. it isn't clear why you are bootstrapping. How does what you are doing differ from having N independent samples of the random variable of interest?

diegzumillo · Nov 29, 2018

I can try explaining what I'm trying to do a little better. I have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness. I take the original data and calculate the skewness. But the data I have is itself a small sample of the entire sample space but I can't run simulations forever so the small sample I have will have to do. I'll use bootstrap, resample it again and again. Each time I resample I calculate the skewness, which will surprisingly be different each time. Then I calculate the standard deviation of all the skewnesses (this word might not exist) obtained with each iteration, as a way of saying how confident I am that the skewness I calculated originally is representative of the larger sample data (the one I don't have).

Sorry for being vague earlier. The thing is I don't know much about this stuff, and whenever I don't know much about something I tend to assume I'm the only dummy who doesn't know about it, therefore everyone else would recognize the problem without a lot of explanation. i.e. I was lazy and presumptuous.

Stephen Tashi · Nov 29, 2018

diegzumillo said:

II have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness.

It's important to know whether you want to estimate a property of a random variable versus a property of N samples of that random variable. For example, the standard deviation of a random variable is a different number that the standard deviation of the mean of 20 independent samples of that random variable. It isn't clear whether you are trying to estimate a parameter of a finite set of outcomes of a random variable or whether you are trying to estimate a parameter associated with the distribution of a single outcome of that random variable.

it isn't clear whether your 5000 data points are independent samples of the same random variable or whether they are generated by a process that introduces a dependence in their values - such as a Markov chain or an AIRMA process.

FactChecker · Dec 20, 2018

1) Are you changing the random number seed from one run to the next?
2) Is the situation being simulated such that the random part is small compared to the whole?
3) Is the significant random part a rare event?

Ray Vickson · Dec 20, 2018

diegzumillo said:

I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.

What is the meaning of the "5000" and of the "50" (as in 50 iterations)? Are you ultimately getting 50 samples of your quantity of interest, or are you getting 5000?

A sample of size 50 is somewhat "small", but people often need to deal with samples that small in applications. If the data are roughly normally distributed you can get confidence intervals on the variance by using the F-distribution.

A sample of size 5000 is really quite good, relative to what people often need to deal with in applications. An inference based on a sample of that size ought to be more "meaningful" than one based on a sample of size 50.

diegzumillo · Mar 27, 2019

Oh shoot. I thought this conversation had died after my last comment because I didn`'t get any notification (probably overlooked it).

This is still a problem to me, by the way. Bootstrapping still gives error bars unrealistically small.

The problem I'm working is a Monte Carlo simulation on a lattice (think Ising model), where I calculate for each temperature observables like magnetization for about 5000 different configurations. Then I calculate a density of states using Ferrenberg Swendsen algorithm.

I'm not very confident about bootstrapping this system because the data is correlated, as is usually the case in monte carlo methods. The Ferrenberg-Swendsen algorithm takes autocorrelation into account, so that's fine, but then bootstrapping? Shouldn't the data be uncorrelated?

Bootstrap in Monte Carlo and the number of samples

What is Bootstrap in Monte Carlo?

Why is Bootstrap useful?

How many samples should be used in Bootstrap?

What is the role of confidence intervals in Bootstrap?

Can Bootstrap be used for any type of data?

Similar threads

Hot Threads

Recent Insights