I Bootstrap in Monte Carlo and the number of samples

AI Thread Summary
The discussion centers on the challenges of using bootstrap methods to estimate the standard deviation from Monte Carlo simulation data. The original poster is experiencing minimal variation in bootstrap iterations, possibly due to a limited number of iterations (50) and the nature of their dataset, which consists of 5000 data points that do not significantly alter the histogram upon resampling. Participants emphasize the importance of understanding whether the data points are independent samples or if they exhibit correlation, as this affects the validity of bootstrap results. The conversation also highlights the need for clarity on the specific quantity being analyzed and the implications of using bootstrap in the context of correlated data. Overall, the effectiveness of bootstrap methods in this scenario remains uncertain due to the data's inherent characteristics.
diegzumillo
Messages
177
Reaction score
20
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.
 
Physics news on Phys.org
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation.

The standard deviation of what?

"Standard deviation" is associated with the distribution of a random variable. What random variable are you talking about?
 
Oh, I mean standard deviation of the thing I measured after the whole analysis. I didn't get in detail because it's not a direct quantity easy to detail. Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.
 
diegzumillo said:
Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.

It's unclear what you mean and what you are doing.

If you have N independent samples of a random variable, there are estimators of the its standard deviation (e.g. the sample standard deviation) that use all N samples. it isn't clear why you are bootstrapping. How does what you are doing differ from having N independent samples of the random variable of interest?
 
I can try explaining what I'm trying to do a little better. I have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness. I take the original data and calculate the skewness. But the data I have is itself a small sample of the entire sample space but I can't run simulations forever so the small sample I have will have to do. I'll use bootstrap, resample it again and again. Each time I resample I calculate the skewness, which will surprisingly be different each time. Then I calculate the standard deviation of all the skewnesses (this word might not exist) obtained with each iteration, as a way of saying how confident I am that the skewness I calculated originally is representative of the larger sample data (the one I don't have).

Sorry for being vague earlier. The thing is I don't know much about this stuff, and whenever I don't know much about something I tend to assume I'm the only dummy who doesn't know about it, therefore everyone else would recognize the problem without a lot of explanation. i.e. I was lazy and presumptuous.
 
diegzumillo said:
II have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness.

It's important to know whether you want to estimate a property of a random variable versus a property of N samples of that random variable. For example, the standard deviation of a random variable is a different number that the standard deviation of the mean of 20 independent samples of that random variable. It isn't clear whether you are trying to estimate a parameter of a finite set of outcomes of a random variable or whether you are trying to estimate a parameter associated with the distribution of a single outcome of that random variable.

it isn't clear whether your 5000 data points are independent samples of the same random variable or whether they are generated by a process that introduces a dependence in their values - such as a Markov chain or an AIRMA process.
 
1) Are you changing the random number seed from one run to the next?
2) Is the situation being simulated such that the random part is small compared to the whole?
3) Is the significant random part a rare event?
 
Last edited:
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.

What is the meaning of the "5000" and of the "50" (as in 50 iterations)? Are you ultimately getting 50 samples of your quantity of interest, or are you getting 5000?

A sample of size 50 is somewhat "small", but people often need to deal with samples that small in applications. If the data are roughly normally distributed you can get confidence intervals on the variance by using the F-distribution.

A sample of size 5000 is really quite good, relative to what people often need to deal with in applications. An inference based on a sample of that size ought to be more "meaningful" than one based on a sample of size 50.
 
Oh shoot. I thought this conversation had died after my last comment because I didn`'t get any notification (probably overlooked it).

This is still a problem to me, by the way. Bootstrapping still gives error bars unrealistically small.

The problem I'm working is a Monte Carlo simulation on a lattice (think Ising model), where I calculate for each temperature observables like magnetization for about 5000 different configurations. Then I calculate a density of states using Ferrenberg Swendsen algorithm.

I'm not very confident about bootstrapping this system because the data is correlated, as is usually the case in monte carlo methods. The Ferrenberg-Swendsen algorithm takes autocorrelation into account, so that's fine, but then bootstrapping? Shouldn't the data be uncorrelated?
 
Back
Top