# Range of Difference: Bounds for Length of Stay

• A
• WWGD

#### WWGD

Gold Member
TL;DR Summary
Want to know if range of hotel stay in days satisfies given bound
Ok, so I'm given hotel data :{Arrival Date, Departure Date}, each in terms of nth day of the year , and I want to estimate whether the range/difference, aka, the length of stay is below a bound. Say a week ( 7 days) for definiteness.

I'm thinking of using either the distribution of the range or to use order statistics for the auxiliary variable Difference in Dates := D= Departure Date - Arrival Date and use the distribution of the range ##D_{Max}- D_{Min}##, or maybe just the distribution of the Max.
Is this a good way?
Thing is I don't know the distribution of neither Arrival Date nor of Departure Date, so I don't see how to compute the distribution of any of these 3: Max Departure, Min Arrival, Max Departure- Min Arrival, to compute the order statistics.
Maybe @StephenTashi can comment?

Last edited by a moderator:
and use the distribution of the range ##D_{Max}- D_{Min}##

I don't understand the data. Is there a possibly different ##D_{max}## for each person who stayed at the hotel? Do most people in the data have more than one stay?

• pbuk
I don't understand the data.
Me either. If you are interested in the length of stay then surely it is trivial to compute it for each stay and do stats on that directly? Are you interested in some mathematical equivalence or actually computing values, and if the latter how is the data stored and what language are you using to analyse it?

I don't understand the data. Is there a possibly different ##D_{max}## for each person who stayed at the hotel? Do most people in the data have more than one stay?
Me either. If you are interested in the length of stay then surely it is trivial to compute it for each stay and do stats on that directly? Are you interested in some mathematical equivalence or actually computing values, and if the latter how is the data stored and what language are you using to analyse it?
Yes, this is what I meant to do, but I was on my phone, which was dying on me. I wanted to define order statistics on the length of stay = Departure Date - Arrival Date. Each date described as nth day of the year.

Length of stay is a discrete random variable, taking values in the Natural Numbers and ##\{0\}##. As such, we can compute its deciles, including median, etc., unless I'm missing something. Or maybe there's some other statistic to evaluate claims about its range.

Length of stay is a discrete random variable,
I would rather say length of stay can be modeled by a discrete random variable.

taking values in the Natural Numbers and ##\{0\}##.
Unless this is the kind of hotel that rents rooms by the hour I think the range is strictly positive As such, we can compute its deciles, including median, etc., unless I'm missing something.
Yes of course, I'm still not seeing where the difficulty lies?

Each date described as nth day of the year.
That will cause problems over year ends; I would be inclined to convert to some other format such as posix timestamp so you have ## days = \lfloor \frac{end - start}{86,400} \rfloor ##.

• WWGD
I would rather say length of stay can be modeled by a discrete random variable.

Unless this is the kind of hotel that rents rooms by the hour I think the range is strictly positive Yes of course, I'm still not seeing where the difficulty lies?

That will cause problems over year ends; I would be inclined to convert to some other format such as posix timestamp so you have ## days = \lfloor \frac{end - start}{86,400} \rfloor ##.
But In order to compute the order statistics, I need to know the distribution of the length of stay. How do I do that? Maybe a bootstrap? Sorry if my question is too simple. I'm not too familiar with this topic.

Last edited:
I need to know the distribution of the length of stay. How do I do that?
By constructing the sample space ## \{ l_i \} = \{ depart_i - arrive_i \} ##.

• WWGD
I wanted to define order statistics on the length of stay = Departure Date - Arrival Date.

More vocabulary issues: For a specific sample of data, "order statistics" is already a defined term - just like "sample mean" is already a defined term. An order statistic for a specific set of data is a constant. Considering it as a formula for computing that number, an order statistic is a random variable.

The same applies to terms like "quartiles" except that one might also apply such a term to a probability distribution instead of a sample. If you think of "quartiles" applying to a probability distribution then they are population parameters instead of sample statistics.

• WWGD
For a concrete example, if the source data is a SQL table we might have
SQL:
SELECT
decile
, MIN(stay) AS min
, MAX(stay) AS max
, AVG(stay) AS mean

FROM (
SELECT
DATEDIFF(depart, arrive) AS stay
, NTILE (10) OVER (
ORDER BY DATEDIFF(depart, arrive)
) AS decile
FROM
stays
) AS stays

GROUP BY
decile
;

• WWGD
For a concrete example, if the source data is a SQL table we might have
SQL:
SELECT
decile
, MIN(stay) AS min
, MAX(stay) AS max
, AVG(stay) AS mean

FROM (
SELECT
DATEDIFF(depart, arrive) AS stay
, NTILE (10) OVER (
ORDER BY DATEDIFF(depart, arrive)
) AS decile
FROM
stays
) AS stays

GROUP BY
decile
;
Thanks. But how do I use this for a test on a given length of stay/ or to construct a confidence interval of some sort? Say the claim is made that average length of stay is 5 days. What statistic do I compute , and what is its distribution? What is an estimator for the population range?

Last edited:
what is its distribution?
That is indeed the question. You see we can't say anything about the relationship between a sample and the population unless we know how the data are distributed.
Say the claim is made that average length of stay is 5 days.
Well let's say the claim is made that length of stay is Poisson distributed with a mean value of 5 days. We could test this with a chi-squared test.

What is an estimator for the population range?
Again that depends on the distribution: many distributions including Poisson have no upper bound. On the other hand a linear distribution is bounded. But think about the implications of this: are you saying that by looking at a sample of some people that have stayed in some hotels during a certain period you want to draw a conclusion that nobody stays in any hotel ever for more than n days?

This situation has other dangers. Let's say you calculate that the average length of stay is 5 days, what does this actually tell you? Certainly not that people who stay for 5 days are your most important customers, for two reasons:
1. You may not have any customers who stay for 5 days - the mean could be made up of 100 stays of 1 night and 200 stays of 7 nights! This points towards the bigger problem:
2. Length of stay is probably not a useful statistic anyway, you probably want length of stay squared, which is what hotels generally measure although in a slightly different form: they look at bed nights, or room nights. So in the above example we would see 100 bed nights on stays of 1 night and 1,400 bed nights on stays of 7 nights, so the average length of stay would be (100 x 1 + 1,400 x 7) / 1,500 = 6.6 nights.
So if you are looking for concise answers to come from means and variances you are going to have to know a lot more about the population distribution.

• WWGD