Bootstrap in Monte Carlo and the number of samples

  • Context: Undergrad 
  • Thread starter Thread starter diegzumillo
  • Start date Start date
  • Tags Tags
    Bootstrap Monte carlo
Click For Summary

Discussion Overview

The discussion revolves around the use of bootstrap methods to estimate the standard deviation from data generated by Monte Carlo simulations. Participants explore the challenges of applying bootstrap techniques, particularly in the context of correlated data and the implications of sample size on the variability of results.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant notes a lack of variation in bootstrap iterations, suggesting that the limited number of iterations (50) and the large dataset (5000 points) may be contributing factors.
  • Another participant questions the specific random variable for which the standard deviation is being estimated, seeking clarification on the nature of the data and the analysis process.
  • A participant explains their approach of calculating skewness from the data and using bootstrap to assess the variability of this estimate, expressing uncertainty about their understanding of the methodology.
  • Concerns are raised about whether the data points are independent samples or if they are correlated, which could affect the validity of bootstrap results.
  • Questions are posed regarding the randomness of the simulations, including whether the random number seed is changed and if the random component is significant compared to the overall simulation.
  • One participant emphasizes the distinction between estimating properties of a random variable versus properties of a finite sample, highlighting the importance of understanding the underlying data structure.
  • A later reply discusses the implications of sample size, noting that while a sample of 5000 is generally robust, a sample of 50 may be considered small in statistical applications.
  • Another participant expresses ongoing concerns about the small error bars produced by bootstrapping in the context of correlated data from Monte Carlo simulations, specifically referencing the Ferrenberg-Swendsen algorithm.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of bootstrap methods for correlated data and the implications of sample size on the results. There is no consensus on the best approach to take in this context, and the discussion remains unresolved.

Contextual Notes

Participants highlight potential limitations related to the independence of data points and the assumptions underlying bootstrap methods. The discussion also touches on the complexity of the analysis and the challenges of interpreting results from correlated data.

diegzumillo
Messages
180
Reaction score
20
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.
 
Physics news on Phys.org
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation.

The standard deviation of what?

"Standard deviation" is associated with the distribution of a random variable. What random variable are you talking about?
 
Oh, I mean standard deviation of the thing I measured after the whole analysis. I didn't get in detail because it's not a direct quantity easy to detail. Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.
 
diegzumillo said:
Schematically it's something like data > process data into a single quantity. Then bootstrapping resampled data > process into single quantity. Then I take all the generated processed quantities of each bootstrap and calculate the standard deviation of the whole set.

It's unclear what you mean and what you are doing.

If you have N independent samples of a random variable, there are estimators of the its standard deviation (e.g. the sample standard deviation) that use all N samples. it isn't clear why you are bootstrapping. How does what you are doing differ from having N independent samples of the random variable of interest?
 
I can try explaining what I'm trying to do a little better. I have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness. I take the original data and calculate the skewness. But the data I have is itself a small sample of the entire sample space but I can't run simulations forever so the small sample I have will have to do. I'll use bootstrap, resample it again and again. Each time I resample I calculate the skewness, which will surprisingly be different each time. Then I calculate the standard deviation of all the skewnesses (this word might not exist) obtained with each iteration, as a way of saying how confident I am that the skewness I calculated originally is representative of the larger sample data (the one I don't have).

Sorry for being vague earlier. The thing is I don't know much about this stuff, and whenever I don't know much about something I tend to assume I'm the only dummy who doesn't know about it, therefore everyone else would recognize the problem without a lot of explanation. i.e. I was lazy and presumptuous.
 
diegzumillo said:
II have a set of data which can be used to calculate a quantity I'm interested in. For sake of example, say we want to calculate the skewness.

It's important to know whether you want to estimate a property of a random variable versus a property of N samples of that random variable. For example, the standard deviation of a random variable is a different number that the standard deviation of the mean of 20 independent samples of that random variable. It isn't clear whether you are trying to estimate a parameter of a finite set of outcomes of a random variable or whether you are trying to estimate a parameter associated with the distribution of a single outcome of that random variable.

it isn't clear whether your 5000 data points are independent samples of the same random variable or whether they are generated by a process that introduces a dependence in their values - such as a Markov chain or an AIRMA process.
 
1) Are you changing the random number seed from one run to the next?
2) Is the situation being simulated such that the random part is small compared to the whole?
3) Is the significant random part a rare event?
 
Last edited:
diegzumillo said:
I am analyzing a lot of data from Montecarlo simulations and trying to use bootstrap to estimate the standard deviation. what I'm finding however is very, very little variation with each iteration of bootstrap, and I don't know why.

Two reasons come to mind. Given the complexity of the analysis I cannot do more than 50 iterations, which sounds like too few according to sources out there, so maybe I just need more? Another thing is I have 5000 data points, resampling barely makes a dent on the histogram, so I'm not surprised it's not changing the statistics that much either.

Anyone with experience with bootstrap analysis have any idea?

PS: I have no idea what 'level' this question is.

What is the meaning of the "5000" and of the "50" (as in 50 iterations)? Are you ultimately getting 50 samples of your quantity of interest, or are you getting 5000?

A sample of size 50 is somewhat "small", but people often need to deal with samples that small in applications. If the data are roughly normally distributed you can get confidence intervals on the variance by using the F-distribution.

A sample of size 5000 is really quite good, relative to what people often need to deal with in applications. An inference based on a sample of that size ought to be more "meaningful" than one based on a sample of size 50.
 
Oh shoot. I thought this conversation had died after my last comment because I didn`'t get any notification (probably overlooked it).

This is still a problem to me, by the way. Bootstrapping still gives error bars unrealistically small.

The problem I'm working is a Monte Carlo simulation on a lattice (think Ising model), where I calculate for each temperature observables like magnetization for about 5000 different configurations. Then I calculate a density of states using Ferrenberg Swendsen algorithm.

I'm not very confident about bootstrapping this system because the data is correlated, as is usually the case in monte carlo methods. The Ferrenberg-Swendsen algorithm takes autocorrelation into account, so that's fine, but then bootstrapping? Shouldn't the data be uncorrelated?
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
67
Views
8K
  • · Replies 1 ·
Replies
1
Views
4K
Replies
1
Views
1K