Quantifying Significance In Re-ordering Of a Set

fasterthanjoao · Jan 10, 2012

Hi folks,

I have what I think is quite a basic question, but I'm looking for options.

So, I have data that consists of a set of numbers (this is not a set theory question) - each number can be ascribed to one of two groups (the source of the number). Now, I have knowledge of the source of each number - but I want to set up a test case, to try and group the numbers such that they separate into their two respective groups (pretending I don't know what the true result is when I start).

Now, my question isn't about the mechanics of re-ordering data or anything of the sort - what I want to do is somehow characterise the 'errors' that the algorithm I'm working with produces. This is perhaps best described by an example: say I have a set of X's and O's (these are actually numbers, but the X and O represents the source of each number - the two groups). The set is ordered arbitrarily: XOOXOXOOOXXOOXXX. I then re-order the set based on some things I know about the numbers and get, say: XXXXOXXOXXOOOOOO.

Then, the set is split into X's and O's with two errors (the O's that are on the side of the X's). This is a pretty good result - and something that I want to quantify. I am thinking I could just do a hypergeometric probability test, splitting the entire set in half and testing the probability that each half contains as many X's and O's as it does. The problem is that I would also like the 'distance' to be important. As in,

OXXXXXXXXOOOOOOO is a worse result than XXXXXXXOXOOOOOOOO, because this O on the left has made it's way to the other end of the other group. Maybe some rank correlation approach would do this?

I want to take care and avoid doing something silly!

thank you,
N

Stephen Tashi · Jan 10, 2012

I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms? In real world problems, people usually can't answer those sorts of questions completely, but it helps to discuss them if you want good advice on how to do the quantification.

fasterthanjoao · Jan 11, 2012

Great reply, thanks - I see what you mean, and that I need much more understanding. Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

Stephen Tashi said:

I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

Stephen Tashi said:

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

Yes. And yes, what you say is true.

Stephen Tashi said:

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms?

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

Stephen Tashi · Jan 11, 2012

fasterthanjoao said:

Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

I suggest you start with a book on simulation or if you aren't a general purpose computer programmer then start with a software package that does simulations. In my opinion, studying statistics has a lobotomizing effect on many people. A specialist in genetics or finance studies statistics for a few weeks and they begin to approach every problem in their field as a problem in statistics without applying what they know about genetics or finance. Doing simulations requires you to use your specialized knowledge.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

That description is still too vague. I don't know whether you are inclined to be secretive or whether you can't express technical ideas clearly or whether you haven't formed a clear picture of the situation in your own mind. Whatever the cause, a person trying to give you specific advice is forced to guess what you are talking about or cross examine you to get information.

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

OK, that establishes that the decision to be made is whether the algorithm works, but it isn't specific enough to rank the possible behaviors of the algorithm - which is essentially what your original question asks. You have to describe the specifically how you use the output of the algorithm to forecast what group a thing belongs to. What are the consequences for mis-classifying a thing?

blue_raver22 · Jan 18, 2012

ate

Hi Nate, thank you for your question. It seems like you are trying to quantify the significance of re-ordering a set of numbers based on certain characteristics. This is a common approach in data analysis and can be achieved through statistical tests such as the hypergeometric probability test or a rank correlation approach.

To determine the significance of your re-ordering, you could compare the observed results to a random or null distribution. This would involve randomly shuffling the numbers in your set and re-ordering them, and then comparing the resulting split into X's and O's to your observed split. This would give you a baseline for what would be expected if the re-ordering had no effect.

Another option could be to use a permutation test, where you randomly shuffle the numbers multiple times and calculate the proportion of shuffles that result in a split with the same or fewer errors as your observed re-ordering. This would give you a p-value, which represents the probability of obtaining the observed result or a more extreme result if the re-ordering had no effect.

In terms of incorporating the 'distance' between the X's and O's, you could consider using a metric such as the Kendall's rank correlation coefficient, which takes into account the order of the numbers in the set. This would allow you to quantify the degree of agreement between the observed and re-ordered sets.

Overall, it is important to carefully consider the assumptions and limitations of any statistical test you use and to interpret the results in the context of your specific data and research question. I hope this helps and good luck with your analysis!

Quantifying Significance In Re-ordering Of a Set

FAQ: Quantifying Significance In Re-ordering Of a Set

1. What is the purpose of quantifying significance in re-ordering of a set?

2. How is significance quantified in re-ordering of a set?

3. What factors should be considered when quantifying significance in re-ordering of a set?

4. Can significance be determined without statistical tests?

5. How can the results of quantifying significance in re-ordering of a set be interpreted?

Similar threads

Hot Threads

Recent Insights