Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Real-world logic problem with statistics and groups

  1. Apr 18, 2013 #1
    First let me apologise for the improper use of 'groups'. I'm not a mathematician but I know that 'groups' means something specific.

    Anyway, here is my problem. The exact circumstances of my problem are esoteric and bothersome to explain, and I don't want to distract you with details that do not matter (but I will if you ask ;-) so I am simplifying somewhat.

    I look after a database that contains recipes for about 1000 products. Each product contains one or more ingredients. Some ingredients are used in many products, other ingredients are used in only a handful of products. Every product has a label that lists the ingredients. There are approximately 600 different ingredients.

    I need to update the information in my database about the ingredients. For the sake of argument, lets say I need to just spell them in capital letters, but in reality I am doing something more involved. Because I have changed the spelling of my ingredients, I need to update the labels of the products that contain those ingredients.

    Changing the spelling of all 600 ingredients will take a very long time, so I need to split the 600 ingredients up into smaller groups.

    Updating the label of each product is expensive, so I only want to do it once per product. This means that when I divide the list of ingredients into groups, all of the products that contain these ingredients must contain only these ingredients.

    The question is "what process should I use to decide how to divide up my list of ingredients"?

    Thanks very much for your thoughts and advice
  2. jcsd
  3. Apr 18, 2013 #2
    If your had a graph that let each label be a vertex, and labels were connected by an edge if they shared an ingredient, then you could divide your groups by connected components... If you only have a couple connected components so that the groups are too large, you could look for clusters, and update by cluster.

    However if you truly only have 600 ingredients, then it seems like working it out in this way would take longer than just updating the 600 products.

    Anyway that's my thought. Not sure if it makes the problem any easier, or just translates it to a different problem.
  4. Apr 18, 2013 #3
    New idea :)

    Not sure what software you are using, but perhaps a little "random poking with a stick" could work out.

    If you can do a "search and return" for labels that contain "any of the ingredients A, B, C,.." and NONE of the ingredients "(the remaining ingredients that are not A, B, C, or whatever)" then you might be able to "happen" upon groups of workable size, if you are lucky and strategic about your search.
  5. Apr 18, 2013 #4
    Thanks for the thoughts. You lost me with the first one. Poking with a stick is something I can do though!

    In principle, 'search and return' to derive groups that contain none of a set of ingredients sounds feasible. The practice is a little tricky for an excel-jockey like me.

    I had a thought which goes like this:

    Take my pool of 1000 products and divide them into 2 groups as follows - <Contains ingredient 1> and <does not contain ingredient 1>
    Take the group <does not contain ingredient 1> and divide it into 2 groups - <Contains ingredient 2> and <does not contain ingredient 2>
    Keep going until I get to <Contains ingredient 600> and <does not contain any ingredients>.

    So then I would work back up that list of 'does not contain...' groups, picking bite-size chunks to work on. But then I wonder if it makes any difference what order I search for ingredients. If it does, then I'd want to optimise the order that I search for ingredients so that I get the most even distribution of products in the 'does not contain...' groups.

    Any ideas how I would prove that the search order is important and then optimise the search order?
  6. Apr 19, 2013 #5
    Well certain common ingredients will create larger groups, like perhaps flour would be an ingredient in 75 percent of products. In a case like that, you would want to update the flour first, and then throw it out as a search parameter, and then you could work with smaller groups that may or may not contain flour, but when you update their labels, the flour will already be fixed.

    It might help to take the (ten) most common ingredients and update them, and then do the grouping technique without regard to those ten ingredients.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook