Variably Spaced Data: Generating Integer Series with Minimal Spacing

  • Thread starter coolnessitself
  • Start date
  • Tags
    Data
In summary: No, this wouldn't work because x_2 would be larger than x_1 + \delta. You might be able to approximate this by generating x_2 using a different function than x_1, like a normal. Then control \delta with some distribution?
  • #1
coolnessitself
35
0
Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal [tex]\Delta[/tex] spacing. For example if the peak is at 500 and [tex]\Delta=2[/tex], you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?
 
Last edited:
Physics news on Phys.org
  • #2
If you have access to Minitab (or some other similar package with random number generators) you can specify the parameters for a random normal distribution including kurtosis and a minimal interval between numbers.
 
Last edited:
  • #3
But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...
 
  • #4
coolnessitself said:
But can you make that interval follow some distribution, but with a minimum interval? I have matlab, and I could easily obtain R, but I'm looking for some math behind it, really. Maybe this is more of an algorithmic thing than a statistics thing...

I haven't used Minitab for a while, but I believe you can write a short program (algorithm) such that [tex]x_{i+1}-x_{i} \geq 2 [/tex] before invoking the random normal generator.

EDIT: If this can't be done, the OP can run this algorithm over the generated sequence to purge it of consecutive integers and interpolate if necessary. The OP did not specify a particular distribution, only a result.
 
Last edited:
  • #5
coolnessitself said:
Hi all,

I'm looking to generate a series of integers in the range [1,1000], that start off with a large spacing, like 1 50 98, etc. At a peak of some distribution, they should be spaced with some minimal [tex]\Delta[/tex] spacing. For example if the peak is at 500 and [tex]\Delta=2[/tex], you might see ... 495, 498, 500, 502, 504, ... Then at the other end of the [1,1000] window, it's back to big spacing 950 999. I'd like the number of integers found in a sliding window of some size in this interval to follow some bell-shaped distribution, but with a constraint on the minimum spacing between numbers.

I get the feeling this is somehow related to a poisson process, so that's why I'm posting here, but that's about it.

Suggestions to bump me in the right direction?

One way is to set [tex]x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil[/tex] where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1. If you want the gaps to be more random then simply add or multiply by a random number from some appropriate distribution.

HTH
 
  • #6
bpet said:
One way is to set [tex]x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil[/tex] where f(x) is some density function - say it's normal, so choose mu=500 then choose c so that the gap is 2 in the middle then choose sigma to specify the gaps at x=1.

The OP is apparently thinking of a discrete distribution (only integers in this case) so f(x) is really a normal approximation. I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.
 
Last edited:
  • #7
SW VandeCarr said:
I don't understand how one chooses "c" to assure that the result is a non-consecutive integer.

no it doesn't but the ceiling function [tex]\left\lceil\cdot\right\rceil[/tex] does :)
 
  • #8
cei-ing the c/f(x) does an adequate job for my purposes, so I'm set for now, but I guess I'm still curious about this problem in general.
A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.
So the ceil(c/f(x)) works for now. It's just tricky finding a sigma that works for some input window, peak, and capacity without trial and error...

As an alternative, would something like this work:
http://www.mathpages.com/home/kmath452.htm
with [tex]x_2 >x_1 + \delta[/tex], then control [tex]\delta[/tex] with some distribution?
 
Last edited:
  • #9
coolnessitself said:
A motivation: Given some bus station has buses arriving throughout the day (or here, seconds 1 to 1000). About noon (500), there are a lot of buses, whereas at midnight they're few and far between. I'm trying to generate these times, but the bus station has a capacity, so I have to have a minimum spacing between arrival times in my generated numbers.

At some places I've lived at the buses tended to arrive in groups - the first one crammed full and some empty ones stuck behind (not that that helps the model).

You might want to take other peak times into account, such as 8am & 6pm, so maybe a multipeaked or piecewise constant or spline shape might be more appropriate as the gap function - then instead of a nonlinear inverse density problem you've just got to choose some interpolation coefficients to match the shape of the desired gap rate function.
 
  • #10
I haven't gotten the [tex]x_2=x_1+\delta[/tex] to work yet, but I like your [tex]c/f(x)[/tex] suggestion since it can work for multiple peaks, as you say. I can get 2 equally-weighted peaks to work fairly well. I know this isn't a coding forum, but maybe this MATLAB can help someone
Code:
clc
maxVal=1000; %interval from 1 to 1000
minSpacing=2; %min spacing 2
maxSpacing=30; %max spacing 30
peakvec=[200 800]; %location of double peaks
weights=[1 1];

%build up a function to minimize
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf([' num2str(1:maxVal) '],' num2str(peakvec(i)) ',sigmavec(' num2str(i) '))'];
end
s(1)=[];
func=inline(s,'sigmavec');
pdf = @(sigma) func(sigma)/max(func(sigma));
minpdf = @(sigma) abs(max(ceil(minSpacing./pdf(sigma))) - maxSpacing);
%minimizing minpdf should result in a pair of sigmas that keep maxSpaxing

sigma_0 = repmat(20,size(peakvec));
warning('off','MATLAB:divideByZero');
sigmavec = fminsearch(minpdf,sigma_0); %,repmat(0,size(peakvec)),[]) %the second part if you have fminsearchbnd
warning('on','MATLAB:divideByZero');
disp(['"Best" sigma chosen as ' num2str(sigmavec)]);

%build up a new pdf to generate x(i+1) from x(i)
s='';
for i=1:length(peakvec)
   s=[s '+' num2str(weights(i)) '*normpdf(x,' num2str(peakvec(i)) ',' num2str(sigmavec(i)) ')'];
end
s(1)=[];

func=inline(s,'x');
pdf = @(x) func(x)./max(func([1:maxVal]));

%plot what it looks like
figure
subplot(2,1,1);
plot([1:maxVal],pdf(1:maxVal));
title('distribution of buses');

%find the actual maxSpacing,minSpacing
maxseen=0; minseen=inf;
maxseenat=[]; minseenat=[];
x(1)=1;
i=1;
while x(i)<=maxVal
    x(i+1)=x(i)+ceil(minSpacing/pdf(x(i)));
    if x(i+1)-x(i)>maxseen
        maxseen=x(i+1)-x(i);
        maxseenat=[maxseenat; i];
    elseif x(i+1)-x(i) < minseen
        minseen =x(i+1)-x(i);
        minseenat=[minseenat; i];
    end
    i=i+1;
end
x(x>maxVal)=[];

disp(['Maximum spacing seen: ' num2str(maxseen) ' at position ' num2str(maxseenat')]);
disp(['Minimum spacing seen: ' num2str(minseen) ' at position ' num2str(minseenat')]);
subplot(2,1,2);
plot(x,1);
title('bus times')

Now back to stats. Add a third peak in the above peakvec and the distribution of bus times isn't even. Say I want n peaks, weighted with [tex]w_1, w_2, \ldots w_n[/tex]. What would my f(x) look like? If I just sum them as [tex]w_1N(\mu_1,\sigma_1) + ... + w_nN(\mu_n,\sigma_n)[/tex], my least squares business above doesn't work. Peaks are arbitrarily scaled as far as I can tell. Do I have to deal with covariance matrices somehow to make the amount of buses arriving in a peak time weighted as I wish? Also, this is insufficient for large intervals (like 60*60*24 seconds instead of 1000)

ps. to be honest this has nothing to do with bus times but it's a fun analogy :]
 
Last edited:
  • #11
You might like to try the Shape Langauge Modeling download on Matlab Central, which might give you a simpler way of fitting the gap function. In particular there's no need to restrict yourself to inverted probability densities. Also avoid using strings/inlines to define functions - in your case it should be possible with something like @(x)sum(arrayfun(...)).

HTH
 
  • #12
I think I would need something like cellfun, but my version of MATLAB doesn't allow anonymous functions in cellfun, just built-in functions, so that won't work. But anyways, that's all fine and good.

Followup question:
Is there a term or subject or something I can look into to describe the distribution of an iterative sequence like the one you offered?
[tex]
x_{i+1}=x_i+\left\lceil c/f(x_i)\right\rceil
[/tex]
but with f(x) general, not specifically a normal? If the ceil() complicates it, real numbers are fine too.
 
  • #13
Aside from what the name of those iterative distributions might be, here's another question.

I'm trying to actually find the distribution of [tex]\max(c/f(x))[/tex] so I can fit it, where f(x) is a [tex]N(\mu,\sigma^2)[/tex]. If I have more peaks I can just sum up my mus and sigmas, so this seems general enough.
So I would want to find the nth order statistic of c/f(x).
So X is a r.v. from this normal, and let Z=c/X.
[tex]
f_Z(z) = \frac{C}{z^2 \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z^2}(C-z\mu)^2\right\} \;\;z\ne 0 [/tex]
[tex]
F_z(z) =\begin{cases}
-\frac{1}{2}\left[1+\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z<0 \\
\frac{1}{2}&\text{if } z=0\\
\frac{1}{2}\left[1-\mathrm{erf}\left( \frac{C - z\mu}{z\sqrt{2\sigma^2}}\right)\right]&\text{if } z>0\end{cases}
[/tex]
[tex]
f(z_{(n)}) = n f(z_{(n)}) \left[F(z_{(n)})\right]^{n-1}
[/tex]
[tex]
E(z_{(n)}) = n \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\infty}^0 \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1+\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)}
[/tex]
[tex]
\;\;\;\;\;+ n \left(\frac{1}{2}\right)^{n-1} \int\limits_{0}^\infty \frac{C}{z_{(n)} \sqrt{2\pi\sigma^2}} \exp\left\{\frac{-1}{2\sigma^2 z_{(n)}^2}(C-z\mu)^2\right\}\left[1-\mathrm{erf}\left( \frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}\right) \right]^{n-1} dz_{(n)}
[/tex]
Let [tex]u=\frac{C - z_{(n)}\mu}{z_{(n)}\sqrt{2\sigma^2}}[/tex], so [tex] dz_{(n)}= \frac{-z_{(n)}^2\sqrt{2\sigma^2}}{C}du[/tex]
[tex]
E(z_{(n)}) = -Cn \left(-\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^\infty \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1+\mathrm{erf}\left( u\right) \right]^{n-1} du
+Cn \left(\frac{1}{2}\right)^{n-1} \int\limits_{-\mu/2\sigma^2}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}u+\sqrt{\pi}\mu} \exp^{-u^2}\left[1-\mathrm{erf}\left( u\right) \right]^{n-1} du
[/tex]
Any insight into this integral? :bugeye::rolleyes:

and finally, if the probabilities for a multinomial arrive from n normals, is there a distribution for the result? Or do you first have to calculate the normal probabilities and plug in?
 
Last edited:

1. What is variably spaced data?

Variably spaced data refers to a set of data points that have different spacing between them. This can happen when the data is collected over a period of time, or when the data is not collected at regular intervals.

2. Why is it important to generate integer series with minimal spacing?

Generating integer series with minimal spacing allows for more accurate analysis and interpretation of the data. It also helps to reduce the amount of noise and outliers in the data, making it easier to identify patterns and trends.

3. What are the potential applications of variably spaced data?

Variably spaced data can be used in a variety of fields, such as finance, biology, and climate science. It can be used to analyze stock market trends, track changes in animal populations, and study climate patterns over time.

4. How can variably spaced data be generated with minimal spacing?

There are several methods for generating integer series with minimal spacing, including linear interpolation, cubic spline interpolation, and nearest neighbor interpolation. These methods use mathematical algorithms to fill in the missing data points with values that closely resemble the existing data.

5. What are some challenges associated with working with variably spaced data?

One challenge is that the data may not be evenly distributed, making it difficult to perform certain statistical analyses. Another challenge is that the data may contain outliers or missing data, which can skew the results of the analysis. It is important to carefully consider and address these challenges when working with variably spaced data.

Similar threads

  • Sci-Fi Writing and World Building
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Sci-Fi Writing and World Building
3
Replies
96
Views
6K
  • Programming and Computer Science
Replies
3
Views
2K
  • STEM Academic Advising
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Replies
10
Views
2K
  • STEM Academic Advising
Replies
4
Views
2K
Replies
1
Views
1K
  • Poll
  • Science and Math Textbooks
Replies
15
Views
15K
Back
Top