# Algebraic sum of continuous random variable probabilities

#### Revvie32

Hi all,

I have a question relating to the sum of continuous random variable probabilities that I hope you can help to answer.

In any probability density function (pdf), dealing with discrete or continuous random variables, the sum of the probabilities of all possible events must equal 1. This stands to reason so no difficulty here.....

If we consider, for example, the case of a pdf of a continuous random variable, Y, such as maximum height of a wave at a constant point in the ocean. If we take n measurements, (say, for agrument sake, one measurement every 24 hours for 1 year, giving us 365 measurements), then the probability of a encountering a maximum wave height of exactly Y in any given 24 hour period is calculated as: number of occurences of wave height equal to Y÷365, from which we can construct our pdf. (Please correct me if I'm incorrect on this).

Now, if we sum all of the recorded probabilities of Y, this will equal 1. However, if we sum probabilities that we can read from the graph of the pdf at smaller and smaller intervals between values of Y, we very quickly reach a stage that this sum exceeds 1, which clearly cannot be the case.

For the example used, I'm only looking at a purely theoretical analysis, i.e. meteorological/ocean processes and any other complications such as independence of wave events can be completely ignored for this answer.

Can anyone explain how this can be please? Much appreciated.

Related Set Theory, Logic, Probability, Statistics News on Phys.org

#### micromass

Hi Revvie32! Hi all,

I have a question relating to the sum of continuous random variable probabilities that I hope you can help to answer.

In any probability density function (pdf), dealing with discrete or continuous random variables, the sum of the probabilities of all possible events must equal 1. This stands to reason so no difficulty here.....
Two remarks. You mean "outcome" instead of "event" here. An event is a set of outcomes and the sum of all events is not necessarily 1.
So, what you meant to say that the sum of probabilites of all outcomes must equal 1. This is actually only true for discrete random variables. The sum of all outcomes for continuous variables doesn't really make sense (not the way we want it at least). For a continuous variables, we demand

$$\int_\mathbb{R}{f(x)dx}$$

thus the sum is changed by the integral.

If we consider, for example, the case of a pdf of a continuous random variable, Y, such as maximum height of a wave at a constant point in the ocean. If we take n measurements, (say, for agrument sake, one measurement every 24 hours for 1 year, giving us 365 measurements), then the probability of a encountering a maximum wave height of exactly Y in any given 24 hour period is calculated as: number of occurences of wave height equal to Y÷365, from which we can construct our pdf. (Please correct me if I'm incorrect on this).
What you are describing here is non-parametric estimation (something I don't know a lot about). But the pdf you will obtain will be a discrete random variable. For example, if you have three measurements 0,0,1. Then you obtain the pdf which has value 2/3 at 0 and which has value 1/3 at 1. Which is thus discrete.

If you want, you can find a continuous distribution which approximates the discrete pdf you obtain, but I know nothing about that.

#### Revvie32

Cheers micromass!

that's all great thanks. What I've actually done is created a histogram of ranges (0-0.99m, 1-1.99m, 2-2.99m etc.) of wave heights and the number of measurements recorded in each range. This can easily become a pdf for a discrete random variable when I convert the number of recordings per year to probabilities, i.e. 3 recordings per year gives a probability of (3/365), which would be useful for the engineering application I'm looking at.

My only remaining question is, when you say pdf has a value of 1/3 in your example of 3 measurements, is that a probability of 1/3 of occuring within the time interval between measurements (i.e. 24 hours) or at any given time in a 48 hour (the time between the 1st and 3rd measurements, assuming they were taken at the same time each day) period ?

Perhaps I'm getting into non-parametric estimation too much here as you say it's not your strong point but if you or anyone else has an answer it would be very much appreciated!

#### Stephen Tashi

If we consider, for example, the case of a pdf of a continuous random variable, Y, such as maximum height of a wave at a constant point in the ocean. If we take n measurements, (say, for agrument sake, one measurement every 24 hours for 1 year, giving us 365 measurements), then the probability of a encountering a maximum wave height of exactly Y in any given 24 hour period is calculated as: number of occurences of wave height equal to Y÷365, from which we can construct our pdf. (Please correct me if I'm incorrect on this).
You need to make your terminology more precise. You speak of the "maximum" height of a wave at a fixed point on the ocean measured at times every 24 hours. If we make a measurement exactly every 24 hours, then we measure wave height at a "point" in time, so I don't understand why the term "maximum" is apt. If you were to measure the wave height continuously over a 1 hour interval, then it would make sense to speak of the maximum recorded wave height in that interval.

Most importantly, you are not making any distinction between the things that make a probability model ( pdfs ) and the process and quantities involved in estimating that model from empirical data. There is no law of probability that says that a histogram of data is the pdf of the random variable being measured. You can say that your procedure for estimating a pdf will be to use a histogram of data if you want. But then you must construct your histogram in such a way that it makes a pdf and not some function that doesn't integrate to 1. If you don't ,then it is your method of estimation that is defective, not the theory of pdfs.

The height of an ocean wave at some fixed point isn't a good prospect for modeling as a random variable unless you only care about the height at widely separated intervals of time. Over small intervals of time, the heights of the wave will be closely related. I think a better model for ocean wave height would be a continuous time stochastic process.

#### Revvie32

Hi Stephen,

those are some fair points, and actually things I hadn't considered previously, so thanks a lot for pointing them out.

Forgetting completely about the example I have used above, and bearing in mind your comments, the more apt question to which I would appreciate the your advice is or that of another expert is:

Procedure:
I measure the height of every single wave at a fixed point in the ocean on every day for exactly 365 days (or 24 x 365=8760 hours). From these wave height readings I tabulate the maximum wave height, which is the continuous random variable, X, on each day, for 365 days, logically giving me 365 values of X.

If I then find that X was between 5m and 6m on 50 days in the entire year, the probability of the maximum wave height being between 5m and 6m, by my reckoning will be P(5<X<6)=(50÷365)=0.137 approximately. I figure that this will be the probability of the maximum wave height being between 5m and 6m in any given 24 hour period in the year in which the data was recorded, rather than any other period of time. Can someone just confirm or correct this absolutely?

I'm fully aware of the simplifications of the above example, i.e. that I'm using empirical measurements to estimate probability rather than having an infinite no. of values over an infinite timeframe, that waves heights on one day are not at all independent from those on the previous day, that instruments don't always record 100% accurately etc.. IGNORING THESE SIMPLIFICATIONS SIMPLY FOR THE PURPOSE OF GETTING THIS THEORY CORRECT, CAN SOMEONE PLEASE ADVISE AS TO MY QUERY?

Many thanks.

#### Stephen Tashi

I figure that this will be the probability of the maximum wave height being between 5m and 6m in any given 24 hour period in the year in which the data was recorded, rather than any other period of time.
A critical question here is whether "the maximum wave height" means the maximum height as taken from a set of discrete measurements (e.g. measurements every hour, each measurement taken at a point in time) or whether you are thinking of wave height as a continuously varying quantity and trying to determine if it is between 5 and 6 meters throughout a given day or given time interval.

I gather you are focused on a random variable model of this problem. You are willing to assume that each measurement is an independent realization of the random variable X. Let's look first at the question of a max of discrete measurements. If your empirical data shows that 24 consecurive random draws of X produce a value of between 5 and 6 with a frequency of 0.137 then, yes, it is a permissible method of estimation to say that 0.137 is the probability that any 24 independent realizations of X (consecutive or not) have that probability of producing a value between 5 and 6.

However, this is not the only way to estimate the probability of X being between 5 and 6 in 24 independent draws. For example, if you think about how people estimate pdf from data, they do things like compute the mean and variance of sets of data and then use a pdf from a family of pdfs (such as normal distributions or lognormal distributions etc.) which has a mean and variance that match the sample data. This method "uses all the data" instead of only the test about what its maximum is. If you estimate the pdf from the mean and variance of the data, you can use the pdf to compute the probability of X falling between 5 and 6 in 24 independent draws. You might get a different number that 0.137, even if that is the frequency observed in the data you used to estimate the pdf.

As to which method of estimation is "better", this has no mathematical answer unless you are willing to supply specifics. You must define quantiatively how to measure if one method is "better" than another and you may have to specify propbability distributions for "prior" information, as needed by Bayesian statistics.

Now, lets consider the wave height X in some time interval and imagine that at each instant of time there is a random realization of X. If you think about this, you'll see that the maximum value of X in any time interval will always be the maximum possible value of X (or the max will be infinity if the pdf of X has no upper limit.) This is because, in any time interval, we must imagine an infinite number of realizations of X. For example, if a random variable has a small probability of taking on the value 73 and you are allowed to take an infinite number of realizations of it, you will eventually be able to get a realization of it that is 73.

It is natural to want to extend the theory of random variables to each point in an interval of time in some way that produces a more pleasing result. To do this, you can't use the same pdf for hourly measurements as you do for measurements every minute, or every second etc. So you must think of a family of random variables X_t where t is some specified time interval between measurements. Following that strategy you get to things like "stationary random functions" or "brownian motion".

#### bpet

...
I measure the height of every single wave at a fixed point in the ocean on every day for exactly 365 days (or 24 x 365=8760 hours). From these wave height readings I tabulate the maximum wave height, which is the continuous random variable, X, on each day, for 365 days, logically giving me 365 values of X.

If I then find that X was between 5m and 6m on 50 days in the entire year, the probability of the maximum wave height being between 5m and 6m, by my reckoning will be P(5<X<6)=(50÷365)=0.137 approximately. I figure that this will be the probability of the maximum wave height being between 5m and 6m in any given 24 hour period in the year in which the data was recorded, rather than any other period of time. Can someone just confirm or correct this absolutely?
...
From a theoretical point of view, 50/365 will be a reasonable estimate of the probability of the wave height being between 5m and 6m even if there is some correlation between the wave heights on successive days (some versions of the CLT allow weak dependence).

However a bigger issue is that the 24 hourly measurements are not capturing the absolute peaks of each day, and it would be quite a challenge to estimate exactly how much the peak is being underestimated.

#### Revvie32

From a theoretical point of view, 50/365 will be a reasonable estimate of the probability of the wave height being between 5m and 6m even if there is some correlation between the wave heights on successive days (some versions of the CLT allow weak dependence).

However a bigger issue is that the 24 hourly measurements are not capturing the absolute peaks of each day, and it would be quite a challenge to estimate exactly how much the peak is being underestimated.
Thanks guys, this is getting very close to the answer I've been searching for. It seems to me that there are clarifications needed here, one from me, and hopefully one from you guys.

Firstly, as regards the issue of not capturing the absolute peak, my suggested procedure was to measure every single wave at a constant point throughout the entire 365 days, as opposed to a single measurement every second, minute, hour or day. Correct me if I'm wrong, but surely this will capture the absolute peak at that constant point on each day throughout the entire year?

Secondly, I will assume for the moment that the issue of capturing the absolute peak on every day is resolved by measuring every single wave height throughout the year. This being the case, is (50/365=0.137) the probability of the maximum wave height in any given 24 hour period having a value between 5m and 6m? I say 24 hour period as 24 hours=1 day, and I am using units of days (365 of them) to compute this probability.

As such would the probability of the maximum wave height in any given 48 hour period having a value between 5m and 6m be equal to (2 x 0.137)=0.274?

#### Stephen Tashi

Revvie32;3328099 Firstly said:
every single wave[/U] at a constant point throughout the entire 365 days, as opposed to a single measurement every second, minute, hour or day.
Apparently by "measuring a wave", you mean that your make a nearly continuous measurement of the the water level as it rises and falls and you record each peak.

Secondly, I will assume for the moment that the issue of capturing the absolute peak on every day is resolved by measuring every single wave height throughout the year. This being the case, is (50/365=0.137) the probability of the maximum wave height in any given 24 hour period having a value between 5m and 6m? I say 24 hour period as 24 hours=1 day, and I am using units of days (365 of them) to compute this probability.
Observed frequencies are not probabilities, but let's assume 0.137 is correct.

As such would the probability of the maximum wave height in any given 48 hour period having a value between 5m and 6m be equal to (2 x 0.137)=0.274?
No, not if we are assuming the max heights in the two 24 hour intervals are independent events. Under the assumption of independence the probability would be the product (0.137)(0.137).

#### bpet

...As such would the probability of the maximum wave height in any given 48 hour period having a value between 5m and 6m be equal to (2 x 0.137)=0.274?
No, to work out the distribution of a maximum first work out the cumulative probabilities. Suppose there is a probability of 0.6 that the max wave height over 24 hours is 5m or less, so the probability of 24h max less than 6m is (0.6+0.137). Assuming independence between days, P[48h max <= 6m] = P[1st 24h max <= 6m and 2nd 24h max <= 6m] = P[single 24h max <= 6m]^2 = (0.6+0.137)^2. Similarly P[48h max <= 5m] = 0.6^2, so P[5m < 48h max <= 6m] = (0.6+0.137)^2-0.6^2 = 0.183 (approx).

#### Stephen Tashi

No, to work out the distribution of a maximum first work out the cumulative probabilities.
It's generally safer to work with the cumulative distribution, but I don't see how to do it in this case. If X1 is the max height of a wave in the first 24 hour period and X2 is the max height of the wave in the second 24 hour period then asserting that:

Pr( X1 and X2 both between 5 and 6) = Pr(both X1 and X2 < 6) - Pr(both X1 and X2 < 5)
would imply that:
Pr(X1 and X2 both between 5 and 6) + Pr(both X1 and X2 < 5) = Pr(both X1 and X2 < 6)
But it is possible for both X1 and X2 to be < 6 without both of them being between 5 and 6 or both being being less than 5. For example X1 might be 4 and X2 might be 5.5.

#### bpet

It's generally safer to work with the cumulative distribution, but I don't see how to do it in this case. If X1 is the max height of a wave in the first 24 hour period and X2 is the max height of the wave in the second 24 hour period then asserting that:

Pr( X1 and X2 both between 5 and 6) = Pr(both X1 and X2 < 6) - Pr(both X1 and X2 < 5)
would imply that:
Pr(X1 and X2 both between 5 and 6) + Pr(both X1 and X2 < 5) = Pr(both X1 and X2 < 6)
But it is possible for both X1 and X2 to be < 6 without both of them being between 5 and 6 or both being being less than 5. For example X1 might be 4 and X2 might be 5.5.
The 48h maximum would be the maximum of X1 and X2, so for this to be between 5 and 6 is not the same as saying that X1 and X2 are both between 5 and 6.