# A Algorithm creates representative set of data

1. May 23, 2016

### mundek88

Hi all,
I have algorithm to analyze and make it easier to implement in programming language (Python). We have table with data and we want to select only representative part.

It looks like:
ID_PRODUCT | CARDINALITY | SET VARIANCE WITH THIS ELEMENT AND ABOVE
10 ---------------- 110 --------------- 400
11 ---------------- 90 ---------------- 350
12 ---------------- 80 ---------------- 300
... --------------- ... ---------------- ...

* variance is calculated for cardinality column

Algorithm works as follows:
Iterate over rows from the top of table and in each loop add new row and count variance for cardinality column. Stop iteratation if variance is equal or less than specified (so, finally we want to produce set of rows with variance bigger than X) and then return created (now representative) set

Question:
This is legacy solution and hard to say for me how we can do it better. Is there any math tool which cut away elements hardly representative? We can not statically based on the cardinality (like: just give rows with cardinality > 50) because the day-to-day can change the order of magnitude.

2. May 27, 2016

### .Scott

I'm guessing that English is not your first language.
In any case, I think I understand roughly what you are looking for. So I'll describe what I understand and if I miss your point, let me know.

You have a table of "N" rows and 3 columns. On field (column) is a numeric "variance" and the rows are order by descending variance.
Basically, you are looking for a function that will discover how many of these rows have a variance greater than a specified value. So a declaration of this function might look something like this: int RecordCount(array Table, int RowCount, float LowVariance)

The most obvious optimization here would be to perform a Boolean search.
For example, if the RowCount is 120, the answer will be a value of 0 to 120, so take the average of those numbers (60) and check the variance in that row.
If the variance at row 60 is less than or equal to LowVariance, then you know that your answer will be 0 to 59, so repeat the process with those numbers.
If the variance at row 60 is greater than LowVariance, then your answer will be 60 to 120, so repeat the process with those numbers.
Finally, you will have a range that is from and to the same value - which you then return as your answer.
Be careful in the way you round for the average. You want to make sure that you don't start a new iteration with the same range that you used in the previous iteration.