Variably Spaced Data: Generating Integer Series with Minimal Spacing

coolnessitself · Dec 18, 2009

Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal [tex]\Delta[/tex] spacing. For example if the peak is at 500 and [tex]\Delta=2[/tex], you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?

SW VandeCarr · Dec 18, 2009

If you have access to Minitab (or some other similar package with random number generators) you can specify the parameters for a random normal distribution including kurtosis and a minimal interval between numbers.

coolnessitself · Dec 18, 2009

But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...

SW VandeCarr · Dec 18, 2009

coolnessitself said:

But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...

I haven't used Minitab for a while, but I believe you can write a short program (algorithm) such that [tex]x_{i+1}-x_{i} \geq 2 [/tex] before invoking the random normal generator.

EDIT: If this can't be done, the OP can run this algorithm over the generated sequence to purge it of consecutive integers and interpolate if necessary. The OP did not specify a particular distribution, only a result.

bpet · Dec 18, 2009

coolnessitself said:

Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal [tex]\Delta[/tex] spacing. For example if the peak is at 500 and [tex]\Delta=2[/tex], you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?

One way is to set [tex]x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil[/tex] where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1. If you want the gaps to be more random then simply add or multiply by a random number from some appropriate distribution.

HTH

SW VandeCarr · Dec 18, 2009

bpet said:

One way is to set [tex]x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil[/tex] where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1.

The OP is apparently thinking of a discrete distribution (only integers in this case) so f(x) is really a normal approximation. I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.

bpet · Dec 18, 2009

SW VandeCarr said:

I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.

no it doesn't but the ceiling function [tex]\left\lceil\cdot\right\rceil[/tex] does :)

coolnessitself · Dec 18, 2009

cei-ing the c/f(x) does an adequate job for my purposes, so I'm set for now, but I guess I'm still curious about this problem in general.
A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.
So the ceil(c/f(x)) works for now. It's just tricky finding a sigma that works for some input window, peak, and capacity without trial and error...

As an alternative, would something like this work:
http://www.mathpages.com/home/kmath452.htm
with [tex]x_2 >x_1 + \delta[/tex], then control [tex]\delta[/tex] with some distribution?

bpet · Dec 18, 2009

coolnessitself said:

A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.

At some places I've lived at the buses tended to arrive in groups - the first one crammed full and some empty ones stuck behind (not that that helps the model).

You might want to take other peak times into account, such as 8am & 6pm, so maybe a multipeaked or piecewise constant or spline shape might be more appropriate as the gap function - then instead of a nonlinear inverse density problem you've just got to choose some interpolation coefficients to match the shape of the desired gap rate function.

coolnessitself · Dec 19, 2009

I haven't gotten the [tex]x_2=x_1+\delta[/tex] to work yet, but I like your [tex]c/f(x)[/tex] suggestion since it can work for multiple peaks, as you say. I can get 2 equally-weighted peaks to work fairly well. I know this isn't a coding forum, but maybe this MATLAB can help someone

Code:

clc
maxVal=1000; %interval from 1 to 1000
minSpacing=2; %min spacing 2
maxSpacing=30; %max spacing 30
peakvec=[200 800]; %location of double peaks
weights=[1 1];

%build up a function to minimize
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf([' num2str(1:maxVal) '],' num2str(peakvec(i)) ',sigmavec(' num2str(i) '))'];
end
s(1)=[];
func=inline(s,'sigmavec');
pdf = @(sigma) func(sigma)/max(func(sigma));
minpdf = @(sigma) abs(max(ceil(minSpacing./pdf(sigma))) - maxSpacing);
%minimizing minpdf should result in a pair of sigmas that keep maxSpaxing

sigma_0 = repmat(20,size(peakvec));
warning('off','MATLAB:divideByZero');
sigmavec = fminsearch(minpdf,sigma_0); %,repmat(0,size(peakvec)),[]) %the second part if you have fminsearchbnd
warning('on','MATLAB:divideByZero');
disp(['"Best" sigma chosen as ' num2str(sigmavec)]);

%build up a new pdf to generate x(i+1) from x(i)
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf(x,' num2str(peakvec(i)) ',' num2str(sigmavec(i)) ')'];
end
s(1)=[];

func=inline(s,'x');
pdf = @(x) func(x)./max(func([1:maxVal]));

%plot what it looks like
figure
subplot(2,1,1);
plot([1:maxVal],pdf(1:maxVal));
title('distribution of buses');

%find the actual maxSpacing,minSpacing
maxseen=0; minseen=inf;
maxseenat=[]; minseenat=[];
x(1)=1;
i=1;
while x(i)<=maxVal
    x(i+1)=x(i)+ceil(minSpacing/pdf(x(i)));
    if x(i+1)-x(i)>maxseen
        maxseen=x(i+1)-x(i);
        maxseenat=[maxseenat; i];
    elseif x(i+1)-x(i) < minseen
        minseen =x(i+1)-x(i);
        minseenat=[minseenat; i];
    end
    i=i+1;
end
x(x>maxVal)=[];

disp(['Maximum spacing seen: ' num2str(maxseen) ' at position ' num2str(maxseenat')]);
disp(['Minimum spacing seen: ' num2str(minseen) ' at position ' num2str(minseenat')]);
subplot(2,1,2);
plot(x,1);
title('bus times')

Now back to stats. Add a third peak in the above peakvec and the distribution of bus times isn't even. Say I want n peaks, weighted with [tex]w_1, w_2, \ldots w_n[/tex]. What would my f(x) look like? If I just sum them as [tex]w_1N(\mu_1,\sigma_1) + ... + w_nN(\mu_n,\sigma_n)[/tex], my least squares business above doesn't work. Peaks are arbitrarily scaled as far as I can tell. Do I have to deal with covariance matrices somehow to make the amount of buses arriving in a peak time weighted as I wish? Also, this is insufficient for large intervals (like 60*60*24 seconds instead of 1000)

ps. to be honest this has nothing to do with bus times but it's a fun analogy :]

bpet · Dec 20, 2009

You might like to try the Shape Langauge Modeling download on Matlab Central, which might give you a simpler way of fitting the gap function. In particular there's no need to restrict yourself to inverted probability densities. Also avoid using strings/inlines to define functions - in your case it should be possible with something like @(x)sum(arrayfun(...)).

HTH

coolnessitself · Dec 22, 2009

I think I would need something like cellfun, but my version of MATLAB doesn't allow anonymous functions in cellfun, just built-in functions, so that won't work. But anyways, that's all fine and good.

Followup question:
Is there a term or subject or something I can look into to describe the distribution of an iterative sequence like the one you offered?
[tex]
x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil
[/tex]
but with f(x) general, not specifically a normal? If the ceil() complicates it, real numbers are fine too.

coolnessitself · Dec 22, 2009

Aside from what the name of those iterative distributions might be, here's another question.

I'm trying to actually find the distribution of [tex]\max(c/f(x))[/tex] so I can fit it, where f(x) is a [tex]N(\mu,\sigma^2)[/tex]. If I have more peaks I can just sum up my mus and sigmas, so this seems general enough.
So I would want to find the nth order statistic of c/f(x).
So X is a r.v. from this normal, and let Z=c/X.
[tex]
f_Z(z) = \frac{C}{z^2 \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z^2}(C-z\mu)^2\right\} \;\;z\ne 0 [/tex]
[tex]
F_z(z) =\begin{cases}
-\frac{1}{2}\left[1+\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z<0 \\
\frac{1}{2}&\text{if } z=0\\
\frac{1}{2}\left[1-\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z>0\end{cases}
[/tex]
[tex]
f(z_{(n)}) = n f(z_{(n)}) \left[F(z_{(n)})\right]^{n-1}
[/tex]
[tex]
E(z_{(n)}) = n \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\infty}^0 \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1+\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)}
[/tex]
[tex]
\;\;\;\;\;+ n \left(\frac{1}{2}\right)^{n-1} \int\limits_{0}^\infty \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1-\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)}
[/tex]
Let [tex]u=\frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}[/tex], so [tex] dz_{(n)}= \frac{-z_{(n)}^2\sqrt{2\sigma^2}}{C}du[/tex]
[tex]
E(z_{(n)}) = -Cn \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^\infty \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1+\mathrm{erf}\left( u\right) \right]^{n-1} du
+Cn \left(\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1-\mathrm{erf}\left( u\right) \right]^{n-1} du
[/tex]
Any insight into this integral? :bugeye:

and finally, if the probabilities for a multinomial arrive from n normals, is there a distribution for the result? Or do you first have to calculate the normal probabilities and plug in?

Variably Spaced Data: Generating Integer Series with Minimal Spacing

1. What is variably spaced data?

2. Why is it important to generate integer series with minimal spacing?

3. What are the potential applications of variably spaced data?

4. How can variably spaced data be generated with minimal spacing?

5. What are some challenges associated with working with variably spaced data?

Similar threads

Hot Threads

Recent Insights