Graduate Algorithm creates representative set of data

Click For Summary
The discussion focuses on an algorithm designed to select a representative subset of data from a table based on variance in the cardinality column. The algorithm iterates through rows, adding them until the variance meets a specified threshold, aiming to produce a set with variance greater than a defined value. Participants suggest optimizing the process through a Boolean search method, which involves checking the variance at the midpoint of the current range to efficiently narrow down the representative set. The need for a more robust mathematical tool to filter out less representative elements is highlighted, as static thresholds based on cardinality are deemed ineffective due to variability in data. Overall, the conversation emphasizes enhancing the algorithm's efficiency while maintaining its effectiveness in data representation.
mundek88
Messages
1
Reaction score
0
Hi all,
I have algorithm to analyze and make it easier to implement in programming language (Python). We have table with data and we want to select only representative part.

It looks like:
ID_PRODUCT | CARDINALITY | SET VARIANCE WITH THIS ELEMENT AND ABOVE
10 ---------------- 110 --------------- 400
11 ---------------- 90 ---------------- 350
12 ---------------- 80 ---------------- 300
... --------------- ... ---------------- ...

* variance is calculated for cardinality columnAlgorithm works as follows:
Iterate over rows from the top of table and in each loop add new row and count variance for cardinality column. Stop iteratation if variance is equal or less than specified (so, finally we want to produce set of rows with variance bigger than X) and then return created (now representative) set

Question:
This is legacy solution and hard to say for me how we can do it better. Is there any math tool which cut away elements hardly representative? We can not statically based on the cardinality (like: just give rows with cardinality > 50) because the day-to-day can change the order of magnitude.

Thanks in advice!
 
Physics news on Phys.org
I'm guessing that English is not your first language.
In any case, I think I understand roughly what you are looking for. So I'll describe what I understand and if I miss your point, let me know.

You have a table of "N" rows and 3 columns. On field (column) is a numeric "variance" and the rows are order by descending variance.
Basically, you are looking for a function that will discover how many of these rows have a variance greater than a specified value. So a declaration of this function might look something like this: int RecordCount(array Table, int RowCount, float LowVariance)

The most obvious optimization here would be to perform a Boolean search.
For example, if the RowCount is 120, the answer will be a value of 0 to 120, so take the average of those numbers (60) and check the variance in that row.
If the variance at row 60 is less than or equal to LowVariance, then you know that your answer will be 0 to 59, so repeat the process with those numbers.
If the variance at row 60 is greater than LowVariance, then your answer will be 60 to 120, so repeat the process with those numbers.
Finally, you will have a range that is from and to the same value - which you then return as your answer.
Be careful in the way you round for the average. You want to make sure that you don't start a new iteration with the same range that you used in the previous iteration.
 
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 20 ·
Replies
20
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 14 ·
Replies
14
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 11 ·
Replies
11
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 13 ·
Replies
13
Views
2K