Quantifying Significance In Re-ordering Of a Set

In summary, the speaker is seeking options for setting up a test case to group numbers into two respective groups without prior knowledge of the true result. They want to quantify the "errors" of the algorithm and are considering using a hypergeometric probability test or a rank correlation approach. They also mention needing a better understanding of the problem and potentially using the results to make suggestions about which group future sets of numbers belong to. They clarify that the algorithm is deterministic and that the quantification of its quality will be used to compare with other algorithms and potentially suggest improvements.
  • #1
fasterthanjoao
731
1
Hi folks,


I have what I think is quite a basic question, but I'm looking for options.

So, I have data that consists of a set of numbers (this is not a set theory question) - each number can be ascribed to one of two groups (the source of the number). Now, I have knowledge of the source of each number - but I want to set up a test case, to try and group the numbers such that they separate into their two respective groups (pretending I don't know what the true result is when I start).

Now, my question isn't about the mechanics of re-ordering data or anything of the sort - what I want to do is somehow characterise the 'errors' that the algorithm I'm working with produces. This is perhaps best described by an example: say I have a set of X's and O's (these are actually numbers, but the X and O represents the source of each number - the two groups). The set is ordered arbitrarily: XOOXOXOOOXXOOXXX. I then re-order the set based on some things I know about the numbers and get, say: XXXXOXXOXXOOOOOO.

Then, the set is split into X's and O's with two errors (the O's that are on the side of the X's). This is a pretty good result - and something that I want to quantify. I am thinking I could just do a hypergeometric probability test, splitting the entire set in half and testing the probability that each half contains as many X's and O's as it does. The problem is that I would also like the 'distance' to be important. As in,

OXXXXXXXXOOOOOOO is a worse result than XXXXXXXOXOOOOOOOO, because this O on the left has made it's way to the other end of the other group. Maybe some rank correlation approach would do this?

I want to take care and avoid doing something silly!


thank you,
N
 
Physics news on Phys.org
  • #2
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms? In real world problems, people usually can't answer those sorts of questions completely, but it helps to discuss them if you want good advice on how to do the quantification.
 
  • #3
Great reply, thanks - I see what you mean, and that I need much more understanding. Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

Stephen Tashi said:
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.


It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

Stephen Tashi said:
For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

Yes. And yes, what you say is true.

Stephen Tashi said:
If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms?

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.
 
  • #4
fasterthanjoao said:
Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

I suggest you start with a book on simulation or if you aren't a general purpose computer programmer then start with a software package that does simulations. In my opinion, studying statistics has a lobotomizing effect on many people. A specialist in genetics or finance studies statistics for a few weeks and they begin to approach every problem in their field as a problem in statistics without applying what they know about genetics or finance. Doing simulations requires you to use your specialized knowledge.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

That description is still too vague. I don't know whether you are inclined to be secretive or whether you can't express technical ideas clearly or whether you haven't formed a clear picture of the situation in your own mind. Whatever the cause, a person trying to give you specific advice is forced to guess what you are talking about or cross examine you to get information.


If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

OK, that establishes that the decision to be made is whether the algorithm works, but it isn't specific enough to rank the possible behaviors of the algorithm - which is essentially what your original question asks. You have to describe the specifically how you use the output of the algorithm to forecast what group a thing belongs to. What are the consequences for mis-classifying a thing?
 
  • #5
ate


Hi Nate, thank you for your question. It seems like you are trying to quantify the significance of re-ordering a set of numbers based on certain characteristics. This is a common approach in data analysis and can be achieved through statistical tests such as the hypergeometric probability test or a rank correlation approach.

To determine the significance of your re-ordering, you could compare the observed results to a random or null distribution. This would involve randomly shuffling the numbers in your set and re-ordering them, and then comparing the resulting split into X's and O's to your observed split. This would give you a baseline for what would be expected if the re-ordering had no effect.

Another option could be to use a permutation test, where you randomly shuffle the numbers multiple times and calculate the proportion of shuffles that result in a split with the same or fewer errors as your observed re-ordering. This would give you a p-value, which represents the probability of obtaining the observed result or a more extreme result if the re-ordering had no effect.

In terms of incorporating the 'distance' between the X's and O's, you could consider using a metric such as the Kendall's rank correlation coefficient, which takes into account the order of the numbers in the set. This would allow you to quantify the degree of agreement between the observed and re-ordered sets.

Overall, it is important to carefully consider the assumptions and limitations of any statistical test you use and to interpret the results in the context of your specific data and research question. I hope this helps and good luck with your analysis!
 

FAQ: Quantifying Significance In Re-ordering Of a Set

1. What is the purpose of quantifying significance in re-ordering of a set?

The purpose of quantifying significance in re-ordering of a set is to determine the extent to which the new ordering differs from the original ordering. This can help identify any patterns, trends, or significant changes in the data that may have occurred.

2. How is significance quantified in re-ordering of a set?

Significance in re-ordering of a set is typically quantified using statistical tests, such as the t-test or ANOVA, which calculate a p-value. This p-value represents the probability of obtaining the observed results by chance alone, and a smaller p-value indicates a more significant difference between the two orderings.

3. What factors should be considered when quantifying significance in re-ordering of a set?

When quantifying significance in re-ordering of a set, it is important to consider the sample size, the type of data being analyzed, the chosen statistical test, and the desired level of significance. These factors can influence the accuracy and reliability of the results.

4. Can significance be determined without statistical tests?

In some cases, significance can be determined without statistical tests by visually inspecting the data and identifying any noticeable differences between the two orderings. However, this approach may not be as accurate or reliable as using statistical tests.

5. How can the results of quantifying significance in re-ordering of a set be interpreted?

The results of quantifying significance in re-ordering of a set can be interpreted by comparing the calculated p-value to a predetermined significance level, usually 0.05. If the p-value is smaller than the significance level, the difference between the two orderings is considered significant. Additionally, the direction and magnitude of the difference can also be taken into account when interpreting the results.

Similar threads

Replies
33
Views
3K
Replies
5
Views
2K
Replies
2
Views
1K
Replies
2
Views
2K
Replies
3
Views
711
Replies
20
Views
3K
Replies
30
Views
3K
Back
Top