Real-world logic problem with statistics and groups

Click For Summary

Discussion Overview

The discussion revolves around a problem related to updating ingredient information in a database of products, specifically focusing on how to effectively group ingredients to minimize the cost of updating product labels. The context includes considerations of statistical grouping, optimization of search processes, and practical implementation challenges.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests using a graph where each label is a vertex, and edges connect labels sharing ingredients, proposing to divide groups by connected components.
  • Another participant introduces a method of "random poking with a stick," suggesting a search and return approach to find groups of ingredients based on inclusion and exclusion criteria.
  • A participant proposes a systematic division of products based on the presence or absence of ingredients, questioning the importance of the order in which ingredients are searched.
  • One participant notes that common ingredients, such as flour, could create larger groups and suggests updating these first to facilitate smaller group management thereafter.
  • There is a consideration of optimizing the search order to achieve a more even distribution of products in the resulting groups.

Areas of Agreement / Disagreement

Participants present multiple competing views on how to approach the grouping of ingredients, with no consensus reached on a single method or process. The discussion remains unresolved regarding the best strategy for grouping ingredients.

Contextual Notes

Participants express uncertainty about the efficiency of different proposed methods and the implications of ingredient commonality on the grouping process. The discussion highlights limitations in practical implementation and the need for further exploration of optimization techniques.

mxplut
Messages
2
Reaction score
0
First let me apologise for the improper use of 'groups'. I'm not a mathematician but I know that 'groups' means something specific.

Anyway, here is my problem. The exact circumstances of my problem are esoteric and bothersome to explain, and I don't want to distract you with details that do not matter (but I will if you ask ;-) so I am simplifying somewhat.

I look after a database that contains recipes for about 1000 products. Each product contains one or more ingredients. Some ingredients are used in many products, other ingredients are used in only a handful of products. Every product has a label that lists the ingredients. There are approximately 600 different ingredients.

I need to update the information in my database about the ingredients. For the sake of argument, let's say I need to just spell them in capital letters, but in reality I am doing something more involved. Because I have changed the spelling of my ingredients, I need to update the labels of the products that contain those ingredients.

Changing the spelling of all 600 ingredients will take a very long time, so I need to split the 600 ingredients up into smaller groups.

Updating the label of each product is expensive, so I only want to do it once per product. This means that when I divide the list of ingredients into groups, all of the products that contain these ingredients must contain only these ingredients.

The question is "what process should I use to decide how to divide up my list of ingredients"?

Thanks very much for your thoughts and advice
Mx
 
Physics news on Phys.org
If your had a graph that let each label be a vertex, and labels were connected by an edge if they shared an ingredient, then you could divide your groups by connected components... If you only have a couple connected components so that the groups are too large, you could look for clusters, and update by cluster.

However if you truly only have 600 ingredients, then it seems like working it out in this way would take longer than just updating the 600 products.

Anyway that's my thought. Not sure if it makes the problem any easier, or just translates it to a different problem.
 
New idea :)

Not sure what software you are using, but perhaps a little "random poking with a stick" could work out.

If you can do a "search and return" for labels that contain "any of the ingredients A, B, C,.." and NONE of the ingredients "(the remaining ingredients that are not A, B, C, or whatever)" then you might be able to "happen" upon groups of workable size, if you are lucky and strategic about your search.
 
Thanks for the thoughts. You lost me with the first one. Poking with a stick is something I can do though!

In principle, 'search and return' to derive groups that contain none of a set of ingredients sounds feasible. The practice is a little tricky for an excel-jockey like me.

I had a thought which goes like this:

Take my pool of 1000 products and divide them into 2 groups as follows - <Contains ingredient 1> and <does not contain ingredient 1>
Take the group <does not contain ingredient 1> and divide it into 2 groups - <Contains ingredient 2> and <does not contain ingredient 2>
Keep going until I get to <Contains ingredient 600> and <does not contain any ingredients>.

So then I would work back up that list of 'does not contain...' groups, picking bite-size chunks to work on. But then I wonder if it makes any difference what order I search for ingredients. If it does, then I'd want to optimise the order that I search for ingredients so that I get the most even distribution of products in the 'does not contain...' groups.

Any ideas how I would prove that the search order is important and then optimise the search order?
 
Well certain common ingredients will create larger groups, like perhaps flour would be an ingredient in 75 percent of products. In a case like that, you would want to update the flour first, and then throw it out as a search parameter, and then you could work with smaller groups that may or may not contain flour, but when you update their labels, the flour will already be fixed.

It might help to take the (ten) most common ingredients and update them, and then do the grouping technique without regard to those ten ingredients.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 40 ·
2
Replies
40
Views
9K
  • · Replies 13 ·
Replies
13
Views
5K
Replies
8
Views
5K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
1
Views
4K
  • · Replies 2 ·
Replies
2
Views
6K
  • · Replies 26 ·
Replies
26
Views
2K