Quantifying Significance In Re-ordering Of a Set

  • Context: Graduate 
  • Thread starter Thread starter fasterthanjoao
  • Start date Start date
  • Tags Tags
    Set Significance
Click For Summary

Discussion Overview

The discussion revolves around quantifying the significance of errors in the re-ordering of a set of numbers that belong to two distinct groups. Participants explore methods to evaluate the performance of an algorithm designed to separate these numbers without prior knowledge of their groupings. The focus is on statistical approaches to characterize the errors produced by the algorithm and the implications of these errors.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a scenario where a set of numbers is re-ordered and seeks to quantify the errors in grouping them into two categories (X's and O's).
  • Another participant emphasizes the need for a clear probability model to discuss the statistics and probabilities involved, questioning whether the algorithm might miscount the total number of X's.
  • There is a suggestion that understanding the probabilistic behavior of the algorithm's output is crucial, especially if the algorithm is deterministic.
  • One participant expresses uncertainty about how to effectively quantify the algorithm's quality and the potential decisions based on that quantification.
  • Another participant proposes starting with simulation techniques or software, cautioning against an over-reliance on statistics without applying specialized knowledge.
  • Concerns are raised about the clarity of the problem description, with calls for more detailed information to provide specific advice.
  • There is a mention of the intention to apply the algorithm to unknown data sets if initial tests are successful, indicating a practical application of the discussed methods.

Areas of Agreement / Disagreement

Participants generally agree on the need for a clearer understanding of the problem and the importance of quantifying algorithm performance. However, there are multiple competing views regarding the best approach to model the problem and the role of statistics versus simulation.

Contextual Notes

Participants note limitations in the problem description, particularly regarding the specifics of the algorithm's behavior and the statistical framework needed for analysis. There is also an acknowledgment of the complexity involved in quantifying errors and the potential variability in algorithm performance based on input data.

fasterthanjoao
Messages
730
Reaction score
1
Hi folks,


I have what I think is quite a basic question, but I'm looking for options.

So, I have data that consists of a set of numbers (this is not a set theory question) - each number can be ascribed to one of two groups (the source of the number). Now, I have knowledge of the source of each number - but I want to set up a test case, to try and group the numbers such that they separate into their two respective groups (pretending I don't know what the true result is when I start).

Now, my question isn't about the mechanics of re-ordering data or anything of the sort - what I want to do is somehow characterise the 'errors' that the algorithm I'm working with produces. This is perhaps best described by an example: say I have a set of X's and O's (these are actually numbers, but the X and O represents the source of each number - the two groups). The set is ordered arbitrarily: XOOXOXOOOXXOOXXX. I then re-order the set based on some things I know about the numbers and get, say: XXXXOXXOXXOOOOOO.

Then, the set is split into X's and O's with two errors (the O's that are on the side of the X's). This is a pretty good result - and something that I want to quantify. I am thinking I could just do a hypergeometric probability test, splitting the entire set in half and testing the probability that each half contains as many X's and O's as it does. The problem is that I would also like the 'distance' to be important. As in,

OXXXXXXXXOOOOOOO is a worse result than XXXXXXXOXOOOOOOOO, because this O on the left has made it's way to the other end of the other group. Maybe some rank correlation approach would do this?

I want to take care and avoid doing something silly!


thank you,
N
 
Physics news on Phys.org
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms? In real world problems, people usually can't answer those sorts of questions completely, but it helps to discuss them if you want good advice on how to do the quantification.
 
Great reply, thanks - I see what you mean, and that I need much more understanding. Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

Stephen Tashi said:
I think that you need to state more details of the problem in order to get a good answer. For example, it isn't clear from your description whether the algorithm you are evaluating sometimes gets the total number of X's wrong in addition to making errors about how the X's and O's are ordered.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.


It isn't possible to rationally discuss the statistics and probabilities involved in a real world problem unless you form a probability model for it - i.e. you form a clear enough picture of the situation so that you could write a computer simulation of it (even if you don't actually write such a simulation).

Stephen Tashi said:
For example, in the case of a deterministic algorithm, how are probabilities involved in its output? We'd have to assume the probabilistic behavior is due to variations in its inputs - i.e. there are some inputs of numbers that is classifies better than others. Is the algorithm you are evaluating deterministic?

Yes. And yes, what you say is true.

Stephen Tashi said:
If you quantify the quality of the algorithm, what decisions will be made on the basis of that quantity? Can you assign a numerical cost to making a wrong decision? Will you be using the method to compare two different algorithms?

If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.
 
fasterthanjoao said:
Perhaps I need a statistics degree :) Maybe I will try to find a basic book. Do you have thoughts?

I suggest you start with a book on simulation or if you aren't a general purpose computer programmer then start with a software package that does simulations. In my opinion, studying statistics has a lobotomizing effect on many people. A specialist in genetics or finance studies statistics for a few weeks and they begin to approach every problem in their field as a problem in statistics without applying what they know about genetics or finance. Doing simulations requires you to use your specialized knowledge.

OK, I think I didn't explain properly. The algorithm takes a set of numbers, and re-orders them. I then look at how the numbers are divided (I know the source of each number, but I did not use this information in the algorithm) - I am testing if the algorithm can recover the groupings without being given that information. Thus, the algorithm does not guess 'X' or 'O' - it separates numbers based on how they might be related - I am hoping that any pattern it finds will happen to recover the X/O information - showing that there is something predictable about each set. I want to distinguish between X/O - but I know when I do this with real data, it will not work perfectly. So I want to quantify any algorithm results.

That description is still too vague. I don't know whether you are inclined to be secretive or whether you can't express technical ideas clearly or whether you haven't formed a clear picture of the situation in your own mind. Whatever the cause, a person trying to give you specific advice is forced to guess what you are talking about or cross examine you to get information.


If the test sets are successful, then I will apply this algorithm to similar sets, where the source of the numbers are unknown - and make suggestion about which group they should belong to, using the test sets as proof of concept. Depending on success or not, I may also use any measure I come up with to test alternative algorithms. I have many test sets. I can also get some work done with these test-sets - and so quantifying the level of success the algorithm has with a particular set is interesting.

OK, that establishes that the decision to be made is whether the algorithm works, but it isn't specific enough to rank the possible behaviors of the algorithm - which is essentially what your original question asks. You have to describe the specifically how you use the output of the algorithm to forecast what group a thing belongs to. What are the consequences for mis-classifying a thing?
 

Similar threads

  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 33 ·
2
Replies
33
Views
5K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
483
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
3
Views
2K
  • · Replies 20 ·
Replies
20
Views
4K