Statistical theory/distribution for tuples/vectors of LLR values

1. Nov 3, 2012

jamie_m

I've been doing some cryptographic research in which I need to be able to assign "scores" to vectors of floating-point values. Each of these floating-point values is a log-likelihood ratio LLR(\hat{q}, p, q) where:

\hat{q} is empirical data. The empirical data for element i in the vector is not necessarily the same as that for element j in the vector.
p is a theoretical distribution.
q is the uniform distribution.

(We're dealing with Bernoulli-distributed data, so q can be represented as the 2-element array {0.5, 0.5})

The idea is that the higher the LLR values in the vector are, the more likely it is that the correct key has been found, and the higher the score assigned to the vector should be.

Currently, I'm using the following scoring system:

Code (Text):

score=0.0;
for (int i=0; i < v.length(); i++)
{
if (v[i] > 0)
{
}
else
{
subtract (v[i] squared) from score
//Yes, the score can take a negative value.
}
}

There are clearly other systems I could use; for example a straightforward summation of the values in the vector.

For the sake of analysing this system, and in particular finding out how large a sample size is required for the correct key to have probability p_k of resulting in the largest score, I'd be interested in knowing if anyone knows anything about the statistical theory (if any!) governing vectors of LLR values, and in particular if anyone knows of a better/more typically used scoring system than the variant on Euclidean distance I've used here.

(Please provide evidence for its being "better" if you can!)

This would also apply to situations where the correct key need not result in the single highest score, just one of the X highest for some pre-decided X.

An earlier version of this work used vectors of Kullback-Leibler distances (between \hat{q} and p) instead. This didn't prove to be as effective, but if anyone can supply any information, similar to the above-requested, for this scenario I'd be very grateful.

Many thanks,

James McLaughlin.

2. Nov 3, 2012

Stephen Tashi

I suggest that you make another attempt at defining the problem.

Make it clear whether an individual random variable in the vector of random variables is a bernoulli random variable or not. Explain if there is any dependence between the random variables in a vector. (It might be best to explain what the random varibles represent. For example, are are counts of some features of a text message? Is each sample vector computed from a text message that has the same size? )

Is the "theoretical" distribution actually correct? For example, there are theoretical distributions for character frequences and n-character combinations in English text. However, these features are not independent of each other. I don't know if there is an theoretical joint distribution for them.

That's a clear question but I'm not aware of any broad and general theory. If we are dealing only with bernoulli random variables the question can be made more precise. However, you have to specify "number of samples" precisely A bernoulli random variable is not the same as a binomial random variable. If each of the samples consists of a vector of realizations of benoulli random variables then it is a vector of 0's and 1's. Is that the situation? If not then perhaps the vectors are counts of some feature. If we treat that as a binomial random variable then there is the question of how many "trials" were involved to get the count.

3. Nov 4, 2012

jamie_m

> I suggest that you make another attempt at defining the problem.

I'll try to do so; and to go into a bit more detail in the process. This is going to be quite long and I'm not sure I can condense it into a tl;dr, plus this will involve some Binomial random variables which are used to derive empirical Bernoulli distributions.

Firstly, I referred to the research as "cryptographic", so I should say something about at least one of the ciphers it applies to. The most straightforward one to use in this description would be the one described in

http://www.engr.mun.ca/~howard/PAPERS/ldc_tutorial.pdf

(This cipher is intended to be used in experiments and in teaching people about cryptography, but isn't strong enough to be used in any real-world applications. It's not the only one this research is relevant to, but it is the simplest.)

The cipher key is eighty bits in length, and is divided into five separate sixteen bit "round keys". A function RF() is applied four times, whereby

RF(16-bit integer i, integer j between 1 and 4) = g(bitwise xor of i and the jth round key) for some function g.

The cipher C is then defined as C(i, key) = bitwise xor of RF(RF(RF(RF(i, 1), 2), 3), 4) and the 5th round key.

One of the features of the function g is that, given some (but not all) of g's output bits, it's possible to use the round key to partially compute g's inverse and obtain some of its input bits. This is because g's output is the concatentaion of four separate functions, each taking four of g's input bits as their own input.

Now, this is what I'm trying to do:

Assume that we have obtained N pairs (plaintext P_i, encryption C_i of P_i under key K), and we'd like to try to deduce some of the bits of K from them. Call them (P, C)-pairs.

The set of bits we are attempting to deduce is a subset of the set of bits in the fifth round key. Let the size of this subset be denoted l_1. Furthermore, we will also be working with - and may be able to obtain some partial information on - a subset of the bits in the fourth round key. Let the number of such fourth-round key bits be denoted l_2.

Define a Boolean function FB, taking several bits as input, and outputting one bit.
More precisely, let FB take an input (n_1 + n_2) bits in length. Let FC be a function mapping the first n_1 bits to some one-bit value x and FD be a function mapping the remaining n_2 bits to some one-bit value y. Then FB is the value (x xor y).

Define a two-dimensional array - call it "counters". counters[k_a][k_b] is defined as:

#((P, C)-pairs) such that, for (l_1)-bit value k_a and (l_2)-bit value k_b, FB(a particular set of n_1 bits in the plaintext, another set of n_2 bits resulting from the use of k_a and k_b to xor part of the ciphertext with k_a and compute some of the inverse of RF) = 0.

Clearly the values in counters are Binomially distributed.

So, for each value of (k_a, k_b), and for each ciphertext C_i, we have computed part of RF^{-1}((part of C_i) ^ k_a) and counted how many times this was equal to a second function computed on the plaintext bits P_i.

Now, for the wrong values of k_a, we expect the values in counters[][] to be approximately N/2 (i.e. uniformly distributed). For the correct guesses, we expect the distribution to be different. We need to have analysed the cipher in some way to get information on the distribution, but we certainly don't expect it to be uniform.

For example, let l_2 be 4 and let the predicted distribution (as you can see, it is in fact a sequence of Bernoulli distributions) be:

Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and k_b) = 0) = 0.875
Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and k_b) = 1) = 1-0.875

Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and (correct k_b xor 0001)) = 0) = 0.375
Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and (correct k_b xor 0001)) = 1) = 1-0.375

Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and any other k_b) = 0) = 0.5
Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and any other k_b) = 1) = 0.5

(Now, I'm lucky enough to be able to do experiments in a situation where the random variables corresponding to each of those Bernoulli distributions are independent of each other. But this won't always be the case. I'm hoping to be able to get some theoretical results for a situation in which they're independent; and then maybe to be able to generalise to at least some situations where they're not.)

This yields the following expectations for the counter values - which as I've said are binomially, not Bernoulli, distributed:

counters[correct k_a][correct k_b] \approx (7/8)*N
counters[correct k_a][correct k_b xor 0001] \approx (3/8)*N
counters[correct k_a][all other values] \approx N/2.
counters[incorrect k_a][any value] \approx N/2.

(I say \approx because the values are very unlikely to be precisely equal to their theoretical expectations.)

Or it may be that the distribution will be either that or

Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and k_b) = 0) = (1 - 0.875) = 0.125
Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and (correct k_b xor 0001)) = 0) = (1 - 0.375) = 0.625
Pr(FB(plaintext bits, bits obtained from partial decryption with correct k_a and any other k_b) = 0) = (1 - 0.5) = 0.5

with corresponding expectations:

counters[correct k_a][correct k_b] \approx (1/8)*N
counters[correct k_a][correct k_b xor 0001] \approx (5/8)*N
counters[correct k_a][all other values] \approx N/2.
counters[incorrect k_a][any value] \approx N/2.

We have no a priori way of predicting which of these two distributions (or "these two sets of Bernoulli distributions") is correct. Unfortunately this two-distribution situation always applies, where each probability value predicted by the first distribution is 1.0 - the value predicted by the second.)

At this stage, let counters_2 be a vector of floating point values such that counters_2[k_a][k_b] = counters[k_a][k_b]/N. We compute this as soon as we've processed all N (P, C)-pairs and computed all the values in counters.

The theoretical expectations as described above are now:

counters_2[correct k_a][correct k_b] \approx 0.875.
counters_2[correct k_a][correct k_b xor 0001] \approx 0.375.
counters_2[correct k_a][all other values] \approx 0.5.
counters_2[incorrect k_a][any value] \approx 0.5.

and

counters_2[correct k_a][correct k_b] \approx 0.125.
counters_2[correct k_a][correct k_b xor 0001] \approx 0.625.
counters_2[correct k_a][all other values] \approx 0.5.
counters_2[incorrect k_a][any value] \approx 0.5.

- i.e. the theoretical probability that each Bernoulli trial (or computation of FB) would result in the value 0.

Now, let's define a new array of floating-point values, Scores, with l_1 elements.
Let's also define two 2D arrays of floating point values, LLRs_with_first_theoretical[][] and LLRs_with_second_theoretical[][].

Code (Text):

for (each possible value of k_a in turn)
{
Scores[k_a] = 0
Reset the values in LLRs_with_first_theoretical[][] and LLRs_with_second_theoretical[][].

for (each possible value of k_b in turn)
{
Proceed as if this was the correct k_b. Call it candidate_k_b.

for (each possible l_2-bit value l in turn)
{
Let \hat{q} denote (empirical Bernoulli distribution in which Pr(0) is the value counters_2[current k_a][candidate_k_b xor l])
Let p denote the first theoretical Bernoulli distribution for the value FB(plaintext bits, bits obtained from partial decryption with correct k_a and (correct k_b xor l))
Let q denote the uniform Bernoulli distribution (Pr(0)=0.5, Pr(1)=0.5)

Calculate the log-likelihood ratio, LLR(\hat{q}, p, q).
Store it in LLRs_with_first_theoretical[candidate_k_b][l]

Do the same with the second theoretical distribution to compute LLRs_with_second_theoretical[candidate_k_b][l]
}

/*
Now, here's the tricky bit.
The values in the vectors "LLRs_with_first_theoretical" and "LLRs_with_second_theoretical"
need to be assigned scores. These scores should reward high values. There are various
possible approaches, such as summing the vector values, but it's not clear which is the
best, or the best-understood theoretically.
*/

if (the highest of these two scores is higher than Scores[k_a])
{
let it be the new value of Scores[k_a]
}
}
}

/*
And here's the second tricky bit.

We want Scores[correct k_a] to be the highest value in Scores.
What value of N do we need for this to be the case with, for example, probability 0.85? 0.95? 0.785?

We can also accept a situation where it is one of the X highest, for some value X.
Can we work out a formula in terms of X and the desired probability of success for the value of N required now?
*/

And, hopefully, there you have it!

4. Nov 5, 2012

Stephen Tashi

Is l_1 going to denote the size of set? Or is it a set? It isn't clear to me what an "(l_1)-bit value" is.

Is "k_a" the value of the bit (i.e. 0 or 1) or is it the index of the bit position in the key? (e.g. the 3rd bit in the 5th round key)

5. Nov 7, 2012

jamie_m

It's the size of the set - that is, it's the number of bits in it. So an (l_1)-bit value is a positive integer that can be expressed as a string of l_1 bits.

(So, for instance, the 8-bit values would be the integers 0 to 255.)

k_a is an l_{1}-bit integer value thus expressed. So if l_1 = 8, and the value of the eight relevant bits is 11010111, k_a = 215.

6. Nov 7, 2012

Stephen Tashi

I haven't read the whole thing, but Linear and Differential Cryptography is a very interesting concept. That's the first time I've heard of it.

I'll attempt some editoral work on your description until I reach the point of total confusion.

The cipher key K is a string of eighty bits. Five separate "round keys" are formed from K.
Bits K[1] through K[16] form round_key[1], the "first round key",. Bits K[17] through K[32] form the "second round key", round_key[2]. etc.

Let g(s) be a boolean function of the 16 bit string bits s. To encrypt a 16 bit string p to a 16 bit string of cypher text C, the following algorithm is used:

Define the fuction RF(t,i) that outputs a 16 bit string as a function of the 16 bit input string t and the integer i, i = 1,2,3,4, as follows: RF(t,i) = g( bitwise xor of t with round_key )

To encrypt a 16 bit string of plain text p into a 16 bit string of cypher text C the following algorithm is used:

c = p
for i = 1 to 4
{
c = RF(c,i)
}
c = bitwise xor of c with round_key[5]

(I don't know yet whether the remarks about g are significant to the problem or just supplementary information).

Assume that we have obtained N pairs of the form (plaintext P_i, encryption C_i of P_i) where P_i is a 16 bit string of plain text and C_i is the encryption of P_i using the algorithm defined above with the key K and the function g(), also defined above.

A big problem in understanding the arrays of counts that you subsequently define is keeping straight on one minds, which aspects of the situation are "fixed" and which vary with the index of the array. My guess at editing is:

Let L1 be a given subsequence of bits of round_key[5]. Let the number of bits in L1 be l_1.
Let L2 be a given subsequence of bits of round_key[4]. Let hte number of bits in L2 be l_2.

Will n_1, n_2 and FB will be "given"? (specific and fixed) for the rest of the discussion?

From your previous post, I know you want to do is let the indices k_a and k_b loop through all possible bit strings of lengths l_1 and l_2 respectively. The rest of what's in the above paragraphis is unclear. For example, I don't know whether "a particular set of n_1 bits in the plain text" defines a particualr set of indices in the string P_i that is kept constant for all counters or whether the indices that define the n_1 bits vary as k_a and k_b vary.

To have a distribtion, what is being randomly sampled? I think we we get only one count per (P,C)-pair. Is that correct? If, so, the random variable is a function of "a randomly selected (Pl,C)-pair".

I'll work on this more later. I'll post this now so I won't lose the effort to some computer glitch.

7. Nov 8, 2012

Stephen Tashi

For now, I'll just assume there is some boolean function W(C_i, P_i, k_a, k_b) which is a function of a (P,C)_pair and two bit strings k_a and k_b. We should name the array of data. I'll say the array A[k_a,k_b] is the number of pairs for which W(C_i,P_i,k_a,k_b) = 0. Since there are N pairs, we can deduce the count for when W(C_i,P_i,k_a,k_b) = 1.

I can understand that example as estimates of probabilities based on several elements of the Array A[][]. However, the example where the argument is "correct k_b xor 0001" is slightly confusing. If this is just a way of referring to one element in the array A[][], I understand it. If it refers to more than one element in the array, I don't.

Let's be specific about what the random variables are and what the population is. My guess is that the population is "all possible sets of (P,C)-pairs" that might occur (not just the ones in your data). For a fixed i, j, the value of B[j] = W(P,C,i,j) is a random variable that depends on the random selection of (P,C). If we randomly select N (P,C)-pairs then the count A[j] is also a random variable.

I can believe, for example, that B[2][4] and B[3][1] are independent random variables if we make an independent draw of a (P,C) pair for each. But we if we evaluating them on the same (P,C) pair and making only one random draw, then this is less plausible. A similar situation holds for the counts as random variables.

This next example (or part of an example? ) is completely confusing. What is meant by "it may be"? Are you illustrating counts from one particular data set of (P,C) pairs in the example above and illustrating counts from a different set of (P,C) pairs in the the example below?

I missing some fundamental concept here. Why have we reached a choice between two distributions that are related in this way? Is it wrong to assume that your array of counts counts how many times W(P,C,i,j) = 0 for a given i,j ?

8. Nov 8, 2012

jamie_m

Thanks for the detailed reply - I'll need to study it in more depth tomorrow to reply properly. But I think I may be able to shed some light on one thing you said:

"Why have we reached a choice between two distributions that are related in this way?"

I'm afraid that it's due to the inner workings of the cipher - which of the two distributions is the correct one depends on the parity of a particular subset of the bits in the first three round keys. Since we don't know what these bits are, we can't compute their parity, and so we don't know in advance whether the distribution corresponding to parity 0, or the distribution corresponding to parity 1, is correct.

9. Nov 9, 2012

jamie_m

They are significant - I have to do this "partial computation of g's inverse", and I need to explain how I can obtain information on g^{-1}(y) without knowing all the bits of y.

This is correct.

Yes.

The first definition - a particular set of indices in P_i that never varies - is the correct one.

Yes, I'd say that was an good description.

It's just a way of referring to one element in the array. Let me try to clarify. Assume E[X] is the expectation of random variable X:

E[A[correct k_a][correct k_b]] = 0.875 * N
E[A[correct k_a][correct k_b xor 0001]] = 0.375 * N

So if the correct value of k_b was 13 in decimal (1101 in binary), then "correct k_b xor 0001" is 1100 in binary (12 in decimal) and we have:

E[A[correct k_a][13]] = 0.875 * N
E[A[correct k_a][12]] = 0.375 * N

This is correct.

That's a good point, since we are indeed "evaluating them on the same (P, C)-pair and making only one random draw". I'll have to admit that I don't know how to quantify the level of statistical dependence that could result from this, or indeed its effect on any LLR-or-otherwise-based scoring system for "the values in the vectors LLRs_with_first_theoretical" and LLRs_with_second_theoretical"

Now, at this point, we come to the two-related-distributions situation. I've explained why there are two such distributions in my comment above, and I want to make it clear that the two distributions are for the same set of (P, C)-pairs. The reason I didn't at first mention the two-distribution situation was that I thought it would overcomplicate the description if I tried to deal with it too early on, and so I mentioned it near to the end of my explanation.

Clearly this did in fact serve to obfuscate, not clarify, the situation, which wasn't my intention. Apologies for that.

10. Nov 9, 2012

Stephen Tashi

That clears up many minor points, but I still don't understand the fundamental question: What are the random variables that are involved in this problem? How exactly are they defined?

What I understand so far is that one random variable is a random choice of a (P,C) pair. We hold various parameters constant and apply some process whose inputs are these parameters and the (P,C) pair. I think the process produces a 0 or 1, so the process is a random variable by virtue of being function of the (P,C)-pair, which is a random variable. When we take a sample of N indpendently chosen (P,C)-pairs, the output of the process gives a total number 0's and a total number of 1's. However, apparently the random variable of interest is not the number of 1's produced by the process and it is not the number of 0's produced by the process. It is somehow ambiguous which count we are interested in. But you haven't defined any probability associated with this ambiguity. So you haven't defined the bottom-line random variable.

The data is an array A[j] of "counts". I'm assuming A[j] is the count of how many times the process produced a 0. If i and j are given, do you know whether the random variable of interest is the count of 0's or the count of 1's for those particular i,j? From what you've said so far, I'm not sure. Your original post talked about using log liklihoods. But if you used log liklihoods, you had to have a specific distribution didn't you?

11. Nov 12, 2012

jamie_m

I think I can clarify further and hopefully eliminate the remaining ambiguity:

This is the process being applied to a single (P, C)-pair:

Code (Text):

We hold various parameters constant, such as the positions of the bits to which the process is applied, as stated.

for (each possible candidate value of k_a in turn)
{
Apply a process with inputs (current candidate value of k_a, P/C pair).
The output from this will be a string of bits.

for (each possible value of k_b in turn)
{
Apply another process, with inputs
(current candidate value of k_b, the string of bits mentioned above, P/C pair).
The output from this will be 1 bit in length.
If the output is 0, add 1 to counters[k_a][k_b]
}
}

This is repeated for all the (P, C)-pairs.

Now, for the correct k_a, we have two sets of theoretical distributions. Each set contains a theoretical distribution for the value of counters[correct k_a][k_b] for all valid values of k_b. So you could say that the number of 0s is the random variable as far as each of these distributions is concerned - though since the number of 1s is equal in all cases to N-this, it contains no more information than the number of 0s.

Code (Text):

(Note that counters[i][j] is A[i][j] as referred to in your previous posting.)

for (each candidate k_a in turn)
{
for (each candidate k_b in turn)
{
Build up a vector v_1, where v_1[i] is the LLR indicating how closely the value of
counters[k_a][i \oplus k_b] matches the theoretical distribution for
counters[k_a][i \oplus correct k_b] from the first set, versus how closely it matches the
uniform.

(v_2 is defined and computed likewise in terms of the second set.)

Each of v_1 and v_2 needs to be assigned a score which rewards high LLR values.

(since high LLR values indicate an empirical distribution close to that predicted,
and far from the uniform distribution that is expected for incorrect k_a values.)

Now, in the experiments I've done since making the original post, a direct summation
of the values in each vector has turned out to be more effective as a scoring system
than the modified sum-of-squares I first described.

Let the current value of the tuple (k_a, k_b) be assigned the highest of these
scores as its own score.
}
k_a's score is the largest score assigned to any of the (k_a, k_b) pairs.
}

The higher k_a's score, the more likely it should be to be the correct value.

And, as stated in the original post, I'm looking for advice on choosing a scoring system, based on any sort of statistical theory related to vectors of LLR values, that will be of use in working out how high N will have to be for the correct value of k_a to have probability at least p of having the highest score - or even just one of the X highest for a given X.

12. Nov 12, 2012

Stephen Tashi

I think I'm beginning to get it.

Ok, for a given k_a, we have two sets of distributions, not simply "two distributions".

I'm not sure why the word "theoretical" applies to both sets of distributions.

So, for example, assume we have 50 PC-pairs. if k_a = 11 is the correct key bit string and k_b varies over the strings 00,01,10,11 then one set of distributions is a vector of "uniform" distributions of counts: We can summarize this vector of uniform distributions in various ways. We could write a vector of the expected number of 0's out of 50 PC pairs, which would be (25,25,25,25) . Or could write the vector that gives the probability of a 0 in each distribution as (0.5,0.5,0.5,0.5).

The other vector of distributions either comes from something you can prove about the behavior of using the correct k_a or it comes from the vector of empirical counts,
(count[11][00],count[11][01],count[11][10],count[11][11]).

( I can see why its convenient not to distringuish between a bit string and the value of the base 2 integer that it represents, but I am making a distinction. So I mean "10" to be a string that represents the number (1)( 2^1 )+ (0)(2^0) = 2, not the number ten.)

13. Nov 13, 2012

jamie_m

No, this is not one of the two sets of theoretical distributions referred to. It could be viewed as the set of theoretical distributions for the *wrong* k_a, I suppose.

Moreover, as you note in your last post, the experimental data has enabled us to construct a set of empirical distributions - the main reason why I referred to sets of theoretical distributions was to distinguish them from the set of empirical distributions.

Not quite - from what we have been able to prove about the behaviour when the correct k_a is used, we have been able to construct two sets of theoretical distributions, but we do not know in advance which is the correct one. If we knew the values of certain bits from the round keys corresponding to earlier rounds, we would know which distribution was correct. But we don't.

If we express one of these two sets of theoretical distributions as a vector v such that v is the probability of 0 in the ith distribution, and index from 1 to m, then the other theoretical distribution would be expressible as a vector of the form:

((1.0 - v[1]), (1.0 - v[2]), ..., (1.0 - v[m]))

14. Nov 13, 2012

Stephen Tashi

For a a given correct k_a, let's call the two possible alternative distributions f1 and f2. If f1 is the one that is correct for, say, k_b = 00 then is it also correct for the other values of k_b, such as 01, 10,11 ? Or does which of f1 and f2 is correct vary as k_b changes?

15. Nov 14, 2012

jamie_m

I have a little trouble understanding this time. However, I can say that if f1 is the correct distribution, changing the value of k_b won't make f2 into the correct one.

If 00 is the correct value of k_b, and if you guess at, say, 01 as the correct value, this pretty much only means that your empirical data won't fit f1 as well as it would have if you'd guessed the correct value, 00.

But [which of f1 and f2 was correct] was already fixed before you started obtaining the (P, C)-pairs, and does not change.

16. Nov 14, 2012

Stephen Tashi

I think I get that, but let me make sure.
Suppose k_a = 00 is the correct key.
Suppose the (correct) probability that a randomly selected PC pair produces a 0 that adds one to the count A[00][00] is 0.8. Does this also mean that the probability that a randomly selected PC pair also has a probability of 0.8 of adding one to the count of A[00][01] ?