# I Supremum inside and outside a probability

1. Mar 15, 2016

### Portella

I'm trying to deal with the supremum concept in a specific situation, but I think I'm getting the concept wrong.

A step of a proof I'm going through states:
$P\ [\sup\limits_{x}\ |f(x)\ -\ f'(x)|\ >\ y\ |\ z]\ \ \leq\ \ \sum_{i=1}^M\ P\ [\ |f(x)\ -\ f'(x)|\ >\ y\ |\ z]\ \ \leq\ \ M\times\ \sup\limits_{x}\ P\ [\ |f(x)\ -\ f'(x)|\ >\ y\ |\ z]$
But wouldn't $P\ [\sup\limits_{x}\ |f(x)\ -\ f'(x)|\ >\ y\ |\ z]$ and $\sup\limits_{x}\ P\ [\ |f(x)\ -\ f'(x)|\ >\ y\ |\ z]$ be the same. After all, wouldn't the same x that bears the greatest absolute value also bear the greatest probability in the second case? I just want to understand the need for the step described above and the consequent multiplication by M.

2. Mar 15, 2016

### andrewkirk

No, I'm afraid they wouldn't. Taking a supremum of probabilities - which are in the range [0,1], and taking the supremum of items $|f(x)\ -\ f'(x)|$ which, for all we know, could have any value at all, are completely different things.

But to give any more help, I'd need to understand your problem better. As written, I cannot make any sense of it, because you have not specified which of your variables are random variables, so it is impossible to interpret the meaning of the probability operator $P$. The usual convention is to write random variables with capitals and other items in lower case. Can you rewrite it to show which items are random variables, and give some more context?

3. Mar 15, 2016

### haruspex

More background needed. What is M? What does the index i represent, given that it only occurs in the range, not in the expression inside the sum. In what way is z a condition affecting f etc?
I assume f' is just another function, with no functional relationship to f.

4. Mar 16, 2016

### Portella

$P\ [\sup\limits_{X}\ |f(x)\ -\ f'(x)|\ >\ y]$ and $\sup\limits_{X}\ P\ [\ |f(x)\ -\ f'(x)|\ >\ y]$
I'm sorry. That's the first time I'm posting here and the first time I'm using Latex. I'm still getting used to it.
I've got rid of z and consider y a constant; f and f' are two different functions. I know that the supremum of probabilities is in range [0,1] and that the supremum of the difference can assume any value, but I want to compare the two probabilities which will have both the same range. This is the only point I need to understand: if both probabilities would assume the same value, since it seems to me that both would be maximized with the x that bears the greatest abasolute distance between the two functions.

5. Mar 16, 2016

### haruspex

Unfortunately you did not answer Andrew's question, where is the random variable in this? I would have guessed y, but you now say treat y as constant. So are f and f' random functions?

6. Mar 17, 2016

### Portella

I used capital X under the supremum, but didn't know if I should do it inside the function, since the notation I'm used to is to use capital letters to represent the whole set only.
I will try to make it clearer:
$P\ [\sup\limits_{X}\ |f(X)\ -\ f'(X)|\ >\ 5]$ and $\sup\limits_{X}\ P\ [\ |f(X)\ -\ f'(X)|\ >\ 5]$
X is random variable with domain [0, 100], for example, and f(X) and f'(X) are two different functions that can assume values between, let's say, 0 and 10.
I want to know the relation between the two probabilities, since, it seems to me that in such a case, the supremum of the absolute difference and the supremum of the probability would be obtained with the same x, i.e. the x in the domain that bears the greatest absolute difference.

7. Mar 17, 2016

### haruspex

Thanks for the extra info, but the reason it was not obvious is that it turns out not to make sense.
Once you take the sup over the range of X, there is no longer any random component, so taking the probability of it is meaningless. Likewise, on the other side of the equation, once you apply P there is no variable for the sup to act on.
Indeed, your previous notation, with f(x) rather than f(X) was correct, and betrays the lack of any r.v. for P to act on.

Where does this question come from?

8. Mar 21, 2016

### Portella

This is why I think I'm making some conceptual confusion. My source was a machine learning book which is used as source for an online course. I have changed some terms because it would be confusing to transcribe them literally and the proof is too long to reproduce here. I've tried to translate the part of it which wasn't clear for me, but I may have introduced some imprecision when doing it.
I could've uploaded the pages, but it would seem like an easy (lazy) solution. Usually we try to simplify the question and make it easier for those willing to help us, but I may have made it more difficult instead.
$P\ [\sup\limits_{H}\ |Ein(h)\ -\ Eout(h)|\ >\ \epsilon\ |\ S]$ and $\sup\limits_{H}\ P\ [\ |Ein(h)\ -\ Eout(h)|\ >\ \epsilon\ |\ S]$
In the book, H is a set of possible hypothesis (dichotomies), Ein is the error in the sample and Eout is the error out of sample, which are a function of h, S is a specific data set. I don't know if that out of context info can help anyhow, but the only other option I can think of would be reproducing the whole proof here.

9. Mar 21, 2016

### haruspex

Ok, that is making more sense, but I'm not quite there.
I am not familiar with the error terminology you use, but it sounds like variance. I.e. you are comparing the variance in the sample with the variance in the whole population. Is that right?
If so, it all makes sense to me now except for the "| S". Having selected hypothesis h and sample set S, there is no random variable left for P. H fixes Eout and S fixes Ein.

10. Mar 23, 2016

### Portella

$P\ [\sup\limits_{H}\ |Ein(h)\ -\ E'in(h)|\ >\ \epsilon\ |\ S]$ and $\sup\limits_{H}\ P\ [\ |Ein(h)\ -\ E'in(h)|\ >\ \epsilon\ |\ S]$

I have commited an error when changing the expression.

Yes, it is a measure of error. The whole proof has as a goal to transform Hoeffding's inequality in order to get rid of the measure of error in relation to the whole population where the set of hypothesis is infinite. At the end, we have a relation between errors in two samples taken from the same population. For that we take a single sample and divide it randomly in two to get the two samples (D and D') we want. S is one such partition and is given here. I'm sorry if the text isn't precise, but I'm doing my best to summarize it.

H is the set of possible hypothesis for this particular sample S. We are computing the probability that, for a given hypothesis, the error measured in one sample (D) will deviate from the error measured in the other sample (D') by more than epsilon. Here is where things become a little foggy for me. As I understand it, we are trying to define the probability that the set H will produce such a set of deviations (Ein(h)-E'in(h)), if we go through all h in H, for which the worst deviation will be greater than epsilon. In the second expression, as I understand it, we would go through the same set of deviations, i.e. (Ein - E'in) for all h in H and see which of them produces the greatest probability for its correspondent disagreement to be greater than epsilon. I think the uncertainty comes from the fact that we don't know the hypothesis set, like, givne a certain set S, which is the division of a single specific sample into two other samples, what is the probability that we'll have such a set of hypothesis that will bring, as the hypothesis producing the greater deviation, one that produces a deviation greater than epsilon.

I'm sorry for the verbosity (and the possible lack of clarity), but it helps me think about the problem and the only other alternative would be to post the pages of the book.

11. Mar 23, 2016

### haruspex

You are not being consistent. Do you mean that S is the sample and the partition is into D and D'? If so, that makes sense at last, the P operator acting over the choice of partition.
Ok, but I asked if that means variance (or maybe standard deviation).
And to recap, you are trying to show that taking sup then P produces a number less than or equal to that from taking P then sup?

12. Mar 23, 2016

### andrewkirk

Based on your verbal description, I think the following would be a better representation of the two expressions:

$$P\left[\sup_{h\in H}\left|Error(X,S,1,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]$$
and
$$\sup_{h\in H}P\left[\left|Error(X,S,2,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]$$

The random variables for which we are measuring probability are
• $S$, whose value is a vector in $\{1,2\}^m$ where $m$ is the number of items in the set being partitioned by $S$. That is, $S$ specifies a rndomly-chosen partition of the sample space into two parts.
• $X$, which is a vector of measurable values for all elements of the sample space.
$h$ is a particular hypothesis from the set $H$ of all possible hypotheses

$Error$ is a function whose value is a realnumber. $Error(X,S,k,h)$ is the error measured in the $k$th part of partition $S$ under hypothesis $h$, where the values of the elements of the sample space are as specified in $X$. The nature of the function is irrelevant to what is being considered here. All we need to know are its domain and range. When the inputs to the function are random variables (as is the case here), the function value becomes a random variable.

Given those definitions, we can re-write the third line of your OP as:

$$P\left[\sup_{h\in H}\left|Error(X,S,1,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]\leq \sum_{h\in H}P\left[\left|Error(X,S,2,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]\leq |H|\times \sup_{h\in H}P\left[\left|Error(X,S,2,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]$$

This now seems to make sense and - on a quick look - the inequalities also seem to be valid.

13. Mar 25, 2016

### haruspex

@Portella , in case Andrew's post has not got you over the line, we can simplify the question a lot.
Index all the possible partitions by i, and all the hypotheses by j. Instead of probabilities, we can use sums over all partitions. For any hypothesis and partition, let xi,j be 1 if the error difference exceeds epsilon and 0 otherwise.
So we want to compare supj sumi xi,j with sumi supj xi,j.
Can you see which can never be less than the other?

14. Mar 30, 2016

### Portella

First of all, I would like to thank you, haruspex and Andrew, for your patience. I could follow Andrew's reformulation and, as I understand it, it seems precise. Well, to understand a precise description is easier than to formulate it and I know I wasn't able to do it well when trying to shortly describe my doubt. Anyway, it turns out that my problem, as stated on the first message, was really a conceptual one. I think I got it.

Consider $\sup_{j}\ \sum_{i}\ x_{i,j}$ compared with $\sum_{i}\ \sup_{j}\ x_{i,j}$, as you have put it, haruspex. I wasn't (or still am) not understanding the use of sup outside the sum (probability). Let's for instance suppose that we have 3 possible partitions for S and 3 hypothesis on H with i being the index for partitions and j for hypothesis, as you defined. Let's $x_{i,j}$ be the error difference, instead of an indicator variable. $x_{1,1} = 2$, means the error difference when using hypothesis 1 with partition 1 is equal to 2. Taking the matrix with indices i,j (columns are hypothesis, rows are partitions),
$\begin{pmatrix} 1 & 2 & 4\\ 3 & 2 & 1\\ 2 & 1 & 3 \end{pmatrix}$
we would have
$\sup_{j}\ \sum_{i}\ x_{i,j}$ = 4 + 1 + 3 = 8, after fixing the single hypothesis that maximizes the sum.
$\sum_{i}\ \sup_{j}\ x_{i,j}$ = 4 + 3 + 3 = 10, using the hypothesis which bear the greatest difference for each partition on its turn.
Is that it?
Then, using the indicator variables you established, i.e. for any hypothesis and partition, let $x_{i,j}$ be 1 if the error difference exceeds epsilon and 0 otherwise. and taking probabilities when $\epsilon$ = 2, we would have
$\begin{pmatrix} 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 0 & 1 \end{pmatrix}$
and
$$P\left[\sup_{h\in H}\left|Error(X,S,1,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]$$
would result in 1 + 1 + 1 = 3/3 = 1
where
$$\sup_{h\in H}P\left[\left|Error(X,S,1,h)-Error(X,S,2,h)\right|>\epsilon\ \middle|\ S\right]$$
would result in 1 + 0 + 1 = 2/3
Is this right?
I just want to be sure I'm processing the sup in sums and probabilities the right way.
That way, the inequalities would hold, as far as I can see, but I want to be sure I'm interpreting it the right way.

15. Mar 30, 2016

### haruspex

That is the way I see it.

16. Apr 1, 2016

### Portella

Ok. If that is right, now I have the tools to understand the proof. Thank you very much!