Algorithm creates representative set of data

In summary, the conversation discusses an algorithm to analyze a table of data and select a representative subset of rows based on a specified variance value. The algorithm works by iterating through the rows and calculating the variance for the cardinality column. It then stops iterating when the variance is equal to or less than the specified value, returning the selected rows as the representative set. The conversation also mentions the possibility of using a mathematical tool to improve the algorithm, such as performing a Boolean search to optimize the process.
  • #1
mundek88
2
0
Hi all,
I have algorithm to analyze and make it easier to implement in programming language (Python). We have table with data and we want to select only representative part.

It looks like:
ID_PRODUCT | CARDINALITY | SET VARIANCE WITH THIS ELEMENT AND ABOVE
10 ---------------- 110 --------------- 400
11 ---------------- 90 ---------------- 350
12 ---------------- 80 ---------------- 300
... --------------- ... ---------------- ...

* variance is calculated for cardinality columnAlgorithm works as follows:
Iterate over rows from the top of table and in each loop add new row and count variance for cardinality column. Stop iteratation if variance is equal or less than specified (so, finally we want to produce set of rows with variance bigger than X) and then return created (now representative) set

Question:
This is legacy solution and hard to say for me how we can do it better. Is there any math tool which cut away elements hardly representative? We can not statically based on the cardinality (like: just give rows with cardinality > 50) because the day-to-day can change the order of magnitude.

Thanks in advice!
 
Physics news on Phys.org
  • #2
I'm guessing that English is not your first language.
In any case, I think I understand roughly what you are looking for. So I'll describe what I understand and if I miss your point, let me know.

You have a table of "N" rows and 3 columns. On field (column) is a numeric "variance" and the rows are order by descending variance.
Basically, you are looking for a function that will discover how many of these rows have a variance greater than a specified value. So a declaration of this function might look something like this: int RecordCount(array Table, int RowCount, float LowVariance)

The most obvious optimization here would be to perform a Boolean search.
For example, if the RowCount is 120, the answer will be a value of 0 to 120, so take the average of those numbers (60) and check the variance in that row.
If the variance at row 60 is less than or equal to LowVariance, then you know that your answer will be 0 to 59, so repeat the process with those numbers.
If the variance at row 60 is greater than LowVariance, then your answer will be 60 to 120, so repeat the process with those numbers.
Finally, you will have a range that is from and to the same value - which you then return as your answer.
Be careful in the way you round for the average. You want to make sure that you don't start a new iteration with the same range that you used in the previous iteration.
 

1) What is an algorithm for creating a representative set of data?

An algorithm for creating a representative set of data is a set of step-by-step instructions or rules that are followed to generate a sample of data that accurately reflects the larger population or dataset. This can be used in various fields such as statistics, machine learning, and data analysis.

2) How does the algorithm ensure the data is representative?

The algorithm uses statistical methods to select a sample that is proportional and unbiased to the larger population. This means that every data point in the population has an equal chance of being included in the sample, resulting in a representative set of data.

3) What factors should be considered when developing an algorithm for creating a representative set of data?

Some factors to consider when developing an algorithm for creating a representative set of data include the size of the population, the desired level of accuracy, and the type of data being collected. It is also important to consider potential biases and how they may affect the representation of the data.

4) Can an algorithm create a completely accurate representation of a dataset?

No, an algorithm can only create a sample that is representative of the larger dataset. The accuracy of the representation depends on the quality of the algorithm and the factors mentioned above. However, with a properly designed and executed algorithm, the representation can be very close to the actual dataset.

5) How can the results of the algorithm be used in practical applications?

The results of the algorithm can be used in various practical applications such as market research, product development, and decision making. By using a representative set of data, the results can provide insights and predictions that can be applied to the larger population or dataset.

Similar threads

  • General Math
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
544
  • Programming and Computer Science
Replies
7
Views
388
  • MATLAB, Maple, Mathematica, LaTeX
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
1K
  • Programming and Computer Science
Replies
9
Views
1K
Back
Top