# Quantifying Significance In Re-ordering Of a Set

1. Jan 10, 2012

### fasterthanjoao

Hi folks,

I have what I think is quite a basic question, but I'm looking for options.

So, I have data that consists of a set of numbers (this is not a set theory question) - each number can be ascribed to one of two groups (the source of the number). Now, I have knowledge of the source of each number - but I want to set up a test case, to try and group the numbers such that they separate into their two respective groups (pretending I don't know what the true result is when I start).

Now, my question isn't about the mechanics of re-ordering data or anything of the sort - what I want to do is somehow characterise the 'errors' that the algorithm I'm working with produces. This is perhaps best described by an example: say I have a set of X's and O's (these are actually numbers, but the X and O represents the source of each number - the two groups). The set is ordered arbitrarily: XOOXOXOOOXXOOXXX. I then re-order the set based on some things I know about the numbers and get, say: XXXXOXXOXXOOOOOO.

Then, the set is split into X's and O's with two errors (the O's that are on the side of the X's). This is a pretty good result - and something that I want to quantify. I am thinking I could just do a hypergeometric probability test, splitting the entire set in half and testing the probability that each half contains as many X's and O's as it does. The problem is that I would also like the 'distance' to be important. As in,

OXXXXXXXXOOOOOOO is a worse result than XXXXXXXOXOOOOOOOO, because this O on the left has made it's way to the other end of the other group. Maybe some rank correlation approach would do this?

I want to take care and avoid doing something silly!

thank you,
N

2. Jan 10, 2012

### Stephen Tashi

I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms? In real world problems, people usually can't answer those sorts of questions completely, but it helps to discuss them if you want good advice on how to do the quantification.

3. Jan 11, 2012

### fasterthanjoao

Great reply, thanks - I see what you mean, and that I need much more understanding. Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

Yes. And yes, what you say is true.

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

4. Jan 11, 2012

### Stephen Tashi

I suggest you start with a book on simulation or if you aren't a general purpose computer programmer then start with a software package that does simulations. In my opinion, studying statistics has a lobotomizing effect on many people. A specialist in genetics or finance studies statistics for a few weeks and they begin to approach every problem in their field as a problem in statistics without applying what they know about genetics or finance. Doing simulations requires you to use your specialized knowledge.

That description is still too vague. I don't know whether you are inclined to be secretive or whether you can't express technical ideas clearly or whether you haven't formed a clear picture of the situation in your own mind. Whatever the cause, a person trying to give you specific advice is forced to guess what you are talking about or cross examine you to get information.

OK, that establishes that the decision to be made is whether the algorithm works, but it isn't specific enough to rank the possible behaviors of the algorithm - which is essentially what your original question asks. You have to describe the specifically how you use the output of the algorithm to forecast what group a thing belongs to. What are the consequences for mis-classifying a thing?