Variably Spaced Data: Generating Integer Series with Minimal Spacing

coolnessitself · Dec 18, 2009

Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal \Delta spacing. For example if the peak is at 500 and \Delta=2, you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?

SW VandeCarr · Dec 18, 2009

If you have access to Minitab (or some other similar package with random number generators) you can specify the parameters for a random normal distribution including kurtosis and a minimal interval between numbers.

coolnessitself · Dec 18, 2009

But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...

SW VandeCarr · Dec 18, 2009

coolnessitself said:

But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...

I haven't used Minitab for a while, but I believe you can write a short program (algorithm) such that x_{i+1}-x_{i} \geq 2 before invoking the random normal generator.

EDIT: If this can't be done, the OP can run this algorithm over the generated sequence to purge it of consecutive integers and interpolate if necessary. The OP did not specify a particular distribution, only a result.

bpet · Dec 18, 2009

coolnessitself said:

Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal \Delta spacing. For example if the peak is at 500 and \Delta=2, you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?

One way is to set x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1. If you want the gaps to be more random then simply add or multiply by a random number from some appropriate distribution.

HTH

SW VandeCarr · Dec 18, 2009

bpet said:

One way is to set x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1.

The OP is apparently thinking of a discrete distribution (only integers in this case) so f(x) is really a normal approximation. I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.

bpet · Dec 18, 2009

SW VandeCarr said:

I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.

no it doesn't but the ceiling function \left\lceil\cdot\right\rceil does :)

coolnessitself · Dec 18, 2009

cei-ing the c/f(x) does an adequate job for my purposes, so I'm set for now, but I guess I'm still curious about this problem in general.
A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.
So the ceil(c/f(x)) works for now. It's just tricky finding a sigma that works for some input window, peak, and capacity without trial and error...

As an alternative, would something like this work:
http://www.mathpages.com/home/kmath452.htm
with x_2 >x_1 + \delta, then control \delta with some distribution?

bpet · Dec 18, 2009

coolnessitself said:

A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.

At some places I've lived at the buses tended to arrive in groups - the first one crammed full and some empty ones stuck behind (not that that helps the model).

You might want to take other peak times into account, such as 8am & 6pm, so maybe a multipeaked or piecewise constant or spline shape might be more appropriate as the gap function - then instead of a nonlinear inverse density problem you've just got to choose some interpolation coefficients to match the shape of the desired gap rate function.

coolnessitself · Dec 19, 2009

I haven't gotten the x_2=x_1+\delta to work yet, but I like your c/f(x) suggestion since it can work for multiple peaks, as you say. I can get 2 equally-weighted peaks to work fairly well. I know this isn't a coding forum, but maybe this MATLAB can help someone

Code:

clc
maxVal=1000; %interval from 1 to 1000
minSpacing=2; %min spacing 2
maxSpacing=30; %max spacing 30
peakvec=[200 800]; %location of double peaks
weights=[1 1];

%build up a function to minimize
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf([' num2str(1:maxVal) '],' num2str(peakvec(i)) ',sigmavec(' num2str(i) '))'];
end
s(1)=[];
func=inline(s,'sigmavec');
pdf = @(sigma) func(sigma)/max(func(sigma));
minpdf = @(sigma) abs(max(ceil(minSpacing./pdf(sigma))) - maxSpacing);
%minimizing minpdf should result in a pair of sigmas that keep maxSpaxing

sigma_0 = repmat(20,size(peakvec));
warning('off','MATLAB:divideByZero');
sigmavec = fminsearch(minpdf,sigma_0); %,repmat(0,size(peakvec)),[]) %the second part if you have fminsearchbnd
warning('on','MATLAB:divideByZero');
disp(['"Best" sigma chosen as ' num2str(sigmavec)]);

%build up a new pdf to generate x(i+1) from x(i)
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf(x,' num2str(peakvec(i)) ',' num2str(sigmavec(i)) ')'];
end
s(1)=[];

func=inline(s,'x');
pdf = @(x) func(x)./max(func([1:maxVal]));

%plot what it looks like
figure
subplot(2,1,1);
plot([1:maxVal],pdf(1:maxVal));
title('distribution of buses');

%find the actual maxSpacing,minSpacing
maxseen=0; minseen=inf;
maxseenat=[]; minseenat=[];
x(1)=1;
i=1;
while x(i)<=maxVal
    x(i+1)=x(i)+ceil(minSpacing/pdf(x(i)));
    if x(i+1)-x(i)>maxseen
        maxseen=x(i+1)-x(i);
        maxseenat=[maxseenat; i];
    elseif x(i+1)-x(i) < minseen
        minseen =x(i+1)-x(i);
        minseenat=[minseenat; i];
    end
    i=i+1;
end
x(x>maxVal)=[];

disp(['Maximum spacing seen: ' num2str(maxseen) ' at position ' num2str(maxseenat')]);
disp(['Minimum spacing seen: ' num2str(minseen) ' at position ' num2str(minseenat')]);
subplot(2,1,2);
plot(x,1);
title('bus times')

Now back to stats. Add a third peak in the above peakvec and the distribution of bus times isn't even. Say I want n peaks, weighted with w_1, w_2, \ldots w_n. What would my f(x) look like? If I just sum them as w_1N(\mu_1,\sigma_1) + ... + w_nN(\mu_n,\sigma_n), my least squares business above doesn't work. Peaks are arbitrarily scaled as far as I can tell. Do I have to deal with covariance matrices somehow to make the amount of buses arriving in a peak time weighted as I wish? Also, this is insufficient for large intervals (like 60*60*24 seconds instead of 1000)

ps. to be honest this has nothing to do with bus times but it's a fun analogy :]

bpet · Dec 20, 2009

You might like to try the Shape Langauge Modeling download on Matlab Central, which might give you a simpler way of fitting the gap function. In particular there's no need to restrict yourself to inverted probability densities. Also avoid using strings/inlines to define functions - in your case it should be possible with something like @(x)sum(arrayfun(...)).

HTH

coolnessitself · Dec 22, 2009

I think I would need something like cellfun, but my version of MATLAB doesn't allow anonymous functions in cellfun, just built-in functions, so that won't work. But anyways, that's all fine and good.

Followup question:
Is there a term or subject or something I can look into to describe the distribution of an iterative sequence like the one you offered?
 x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil 
but with f(x) general, not specifically a normal? If the ceil() complicates it, real numbers are fine too.

coolnessitself · Dec 22, 2009

Aside from what the name of those iterative distributions might be, here's another question.

I'm trying to actually find the distribution of \max(c/f(x)) so I can fit it, where f(x) is a N(\mu,\sigma^2). If I have more peaks I can just sum up my mus and sigmas, so this seems general enough.
So I would want to find the nth order statistic of c/f(x).
So X is a r.v. from this normal, and let Z=c/X.
 f_Z(z) = \frac{C}{z^2 \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z^2}(C-z\mu)^2\right\} \;\;z\ne 0
 F_z(z) =\begin{cases} -\frac{1}{2}\left[1+\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z<0 \\ \frac{1}{2}&\text{if } z=0\\ \frac{1}{2}\left[1-\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z>0\end{cases} 
 f(z_{(n)}) = n f(z_{(n)}) \left[F(z_{(n)})\right]^{n-1} 
 E(z_{(n)}) = n \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\infty}^0 \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1+\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)} 
 \;\;\;\;\;+ n \left(\frac{1}{2}\right)^{n-1} \int\limits_{0}^\infty \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1-\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)} 
Let u=\frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}, so dz_{(n)}= \frac{-z_{(n)}^2\sqrt{2\sigma^2}}{C}du
 E(z_{(n)}) = -Cn \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^\infty \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1+\mathrm{erf}\left( u\right) \right]^{n-1} du +Cn \left(\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1-\mathrm{erf}\left( u\right) \right]^{n-1} du 
Any insight into this integral? :bugeye:

and finally, if the probabilities for a multinomial arrive from n normals, is there a distribution for the result? Or do you first have to calculate the normal probabilities and plug in?

Variably Spaced Data: Generating Integer Series with Minimal Spacing

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

I Stochastic calculus: Ito's lemma and differentials

I Help me understand skewness in QQ-plots please

I Intransitive implication

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem