Algorithm creates representative set of data

mundek88 · May 23, 2016

Hi all,
I have algorithm to analyze and make it easier to implement in programming language (Python). We have table with data and we want to select only representative part.

It looks like:
ID_PRODUCT | CARDINALITY | SET VARIANCE WITH THIS ELEMENT AND ABOVE
10 ---------------- 110 --------------- 400
11 ---------------- 90 ---------------- 350
12 ---------------- 80 ---------------- 300
... --------------- ... ---------------- ...

* variance is calculated for cardinality columnAlgorithm works as follows:
Iterate over rows from the top of table and in each loop add new row and count variance for cardinality column. Stop iteratation if variance is equal or less than specified (so, finally we want to produce set of rows with variance bigger than X) and then return created (now representative) set

Question:
This is legacy solution and hard to say for me how we can do it better. Is there any math tool which cut away elements hardly representative? We can not statically based on the cardinality (like: just give rows with cardinality > 50) because the day-to-day can change the order of magnitude.

Thanks in advice!

.Scott · May 27, 2016

I'm guessing that English is not your first language.
In any case, I think I understand roughly what you are looking for. So I'll describe what I understand and if I miss your point, let me know.

You have a table of "N" rows and 3 columns. On field (column) is a numeric "variance" and the rows are order by descending variance.
Basically, you are looking for a function that will discover how many of these rows have a variance greater than a specified value. So a declaration of this function might look something like this: int RecordCount(array Table, int RowCount, float LowVariance)

The most obvious optimization here would be to perform a Boolean search.
For example, if the RowCount is 120, the answer will be a value of 0 to 120, so take the average of those numbers (60) and check the variance in that row.
If the variance at row 60 is less than or equal to LowVariance, then you know that your answer will be 0 to 59, so repeat the process with those numbers.
If the variance at row 60 is greater than LowVariance, then your answer will be 60 to 120, so repeat the process with those numbers.
Finally, you will have a range that is from and to the same value - which you then return as your answer.
Be careful in the way you round for the average. You want to make sure that you don't start a new iteration with the same range that you used in the previous iteration.

Algorithm creates representative set of data

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad The countability paradox of computable numbers

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect