Quantifying Significance In Re-ordering Of a Set

  • Context: Graduate 
  • Thread starter Thread starter fasterthanjoao
  • Start date Start date
  • Tags Tags
    Set Significance
Click For Summary
SUMMARY

The discussion centers on quantifying the significance of errors produced by an algorithm that re-orders a set of numbers classified into two groups, represented as X's and O's. The user seeks to evaluate the algorithm's performance without prior knowledge of the groupings, aiming to characterize the errors in terms of both quantity and distance from the correct group. Suggestions include employing a hypergeometric probability test and considering rank correlation methods to assess the algorithm's effectiveness. The conversation emphasizes the necessity of forming a clear probability model to rationally discuss the algorithm's output and its implications.

PREREQUISITES
  • Understanding of hypergeometric probability tests
  • Familiarity with rank correlation methods
  • Basic knowledge of algorithm evaluation techniques
  • Concept of probability modeling in statistical analysis
NEXT STEPS
  • Research hypergeometric probability tests for error quantification
  • Explore rank correlation methods for assessing algorithm performance
  • Study probability modeling techniques for algorithm evaluation
  • Investigate simulation software for practical application of statistical concepts
USEFUL FOR

Data scientists, algorithm developers, statisticians, and anyone involved in evaluating the performance of classification algorithms.

fasterthanjoao
Messages
730
Reaction score
1
Hi folks,


I have what I think is quite a basic question, but I'm looking for options.

So, I have data that consists of a set of numbers (this is not a set theory question) - each number can be ascribed to one of two groups (the source of the number). Now, I have knowledge of the source of each number - but I want to set up a test case, to try and group the numbers such that they separate into their two respective groups (pretending I don't know what the true result is when I start).

Now, my question isn't about the mechanics of re-ordering data or anything of the sort - what I want to do is somehow characterise the 'errors' that the algorithm I'm working with produces. This is perhaps best described by an example: say I have a set of X's and O's (these are actually numbers, but the X and O represents the source of each number - the two groups). The set is ordered arbitrarily: XOOXOXOOOXXOOXXX. I then re-order the set based on some things I know about the numbers and get, say: XXXXOXXOXXOOOOOO.

Then, the set is split into X's and O's with two errors (the O's that are on the side of the X's). This is a pretty good result - and something that I want to quantify. I am thinking I could just do a hypergeometric probability test, splitting the entire set in half and testing the probability that each half contains as many X's and O's as it does. The problem is that I would also like the 'distance' to be important. As in,

OXXXXXXXXOOOOOOO is a worse result than XXXXXXXOXOOOOOOOO, because this O on the left has made it's way to the other end of the other group. Maybe some rank correlation approach would do this?

I want to take care and avoid doing something silly!


thank you,
N
 
Physics news on Phys.org
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms? In real world problems, people usually can't answer those sorts of questions completely, but it helps to discuss them if you want good advice on how to do the quantification.
 
Great reply, thanks - I see what you mean, and that I need much more understanding. Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

Stephen Tashi said:
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.


It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

Stephen Tashi said:
For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

Yes. And yes, what you say is true.

Stephen Tashi said:
If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms?

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.
 
fasterthanjoao said:
Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

I suggest you start with a book on simulation or if you aren't a general purpose computer programmer then start with a software package that does simulations. In my opinion, studying statistics has a lobotomizing effect on many people. A specialist in genetics or finance studies statistics for a few weeks and they begin to approach every problem in their field as a problem in statistics without applying what they know about genetics or finance. Doing simulations requires you to use your specialized knowledge.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

That description is still too vague. I don't know whether you are inclined to be secretive or whether you can't express technical ideas clearly or whether you haven't formed a clear picture of the situation in your own mind. Whatever the cause, a person trying to give you specific advice is forced to guess what you are talking about or cross examine you to get information.


If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

OK, that establishes that the decision to be made is whether the algorithm works, but it isn't specific enough to rank the possible behaviors of the algorithm - which is essentially what your original question asks. You have to describe the specifically how you use the output of the algorithm to forecast what group a thing belongs to. What are the consequences for mis-classifying a thing?
 

Similar threads

  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 33 ·
2
Replies
33
Views
4K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K
Replies
3
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K