# Help! - Combinatorics cause heavy computing

1. May 28, 2008

### einarbmag

I have n comparable sets of data, several thousand numbered values in each set. A certain number can be calculated by choosing m of these sets, let's call it S_m. Now, the value of S_m depends on which sets I choose, so it can be thought of as a function of m variables which can only take discrete values, and no two the same. (corresponding to the sets chosen)

My problem is to for a certain m find the combination/s which give/s me the max/min of S_m.
It can be done by calculating S_m for every combination, but this quickly amounts to a huge number of calculations, the most of course for m=n/2. I therefore probably need some algorithm for reaching the right combination.

Any ideas?

2. May 28, 2008

### Hurkyl

Staff Emeritus
With the information you've given us, any (correct) algorithm requires you to try every combination. To do better than that, you will need to use knowledge of the specific function you're trying to maximize.

3. May 28, 2008

### einarbmag

Ok, the sets are columns of values, and when we choose m sets we put the columns together in a matrix.

Now, S_m = a/b, where a is the largest row-sum and b is the sum of the maxima of each column.

4. May 28, 2008

### CRGreathouse

How is the row sum defined? $$S_2(\{1,2\},\{3,4\})$$ can involve

$$\left[\begin{array}{ccc} 1&3\\ 2&4\end{array}\right]$$ or $$\left[\begin{array}{ccc} 2&3\\ 1&4\end{array}\right]$$

which have maximum row sums 6 and 5, respectively. Since sets (by definition!) don't have order, how do you choose?

5. May 28, 2008

### einarbmag

I guess I was a little mathematically unprecise. (Not a mathematician, almost a physicist)
Lists is probably more precise than sets, the values are in a time sequence and therefore ordered.
By row sum I just mean the sum of the elements in a row, and the maximum row sum is just the largest row sum.

Physical interpretation of the problem: values are recorded in n sites in a time sequence.
Choosing m of the n sites, I need to find out which combination gives me the highest/lowest ratio between (the highest peak of the sum of values) and (the sum of (the highest peak in each site))

6. May 28, 2008

### CRGreathouse

In that case m = n/2 is no longer your worst case; your worst case is m = n (or n - 1). This nearly squares the number of cases!

Can you give a small worked out example so I'm sure I properly understand? For example, with sets {1, 2} and {2, 3}, the matrices for $$S_2$$ could be

$$\left[\begin{array}{cc} 1&2\\ 2&3 \end{array}\right]$$ $$\left[\begin{array}{cc} 1&3\\ 2&2 \end{array}\right]$$ $$\left[\begin{array}{cc} 2&2\\ 1&3 \end{array}\right]$$ $$\left[\begin{array}{cc} 2&3\\ 1&2 \end{array}\right]$$

and the matrices for $$S_1$$

[1 2] [1 3] [2 2] [2 3]

right?

7. May 28, 2008

### einarbmag

I think you are making this more complicated than it is:

I have a table of values, each column (n columns) represents a value recording site, each row all values recorded at a specific time. I pick m of these columns and calculate mentioned S_m for those. Which combination of m columns give me the highest/lowest S_m?

Example:

1 & 2 &2 \\
2 & 4 & 3 \\
3 & 1 & 1 \\

Let's calculate S_2 for all 3 combinations:
Column 1 and 2:
Maximum row sum is 6, sum of maximum of columns is 7, so S_2=6/7.

columns 1 and 3:
Maximum row sum is 5, sum of maximum of columns is 6, so S_2=5/6.

Columns 2 and 3:
Mrs = 7 , sum of max of col is 7, so S_2=1.

This should clear it up. Btw, how do I show LaTeX?

8. May 28, 2008

### CRGreathouse

(LaTeX: do $...$ as [ tex]...[/tex] and $...$ as [ itex]...[/itex] without the spaces.)

Okay, that's yet another case (glad I asked for an example!). The bad case there is m = n/3, give or take. I'll have to think about this. What restrictions are there on the values? All positive? Integers or reals? Will they tend to be close in value?

9. May 28, 2008

### einarbmag

All right,
the values are positive reals, and high values tend to happen closely in time over all columns.(I.e. in rows close to each other) Values can differ in 1-2 magnitudes of order.

10. May 30, 2008

### einarbmag

It's a tough one, but I really have to find some other method than brute-force. Taking only 10 thousand combinations for each m, having n=50, calculating $S_m$ for all combinations and all m up to 49 takes a few hours. And that doesn't even come close to solving my problem, finding the max/min of $S_m$ for each m. This is partly caused by the vast number of rows that need to be summed up, about 35 thousand.

11. May 30, 2008

### CRGreathouse

At the moment you're talking about a problem that (with n = 50) would take in the neighborhood of 1e51 years, or worst-case (n = 10,000) 1e16122 years. The age of the universe, for comparison, is estimated around 1e10 to 3e10 years.

Would a good estimate be worthwhile? Simulated annealing might be a viable approach if that's the case.

12. May 30, 2008

### einarbmag

I started looking around yesterday for search algorithms and stumbled upon simulated annealing, and it does look slightly promising.

I might try that, but I suspect that to truly solve this in an elegant way I need to come up with something based on how $$S_m$$ is calculated.

13. May 30, 2008

### CRGreathouse

Depending on how layered the samples are (correlation across rows), here's a search approach that may work. For each row, find a suitable average (median or truncated mean). Next, compute for each column the deviation from the mean (sum of absolute or squared error). Starting from those with the least deviation, assign columns alternately to each of two groups. When both are populated with m samples, sum the values and compare. With a certain probability either swap values between lists or swap from a list to one of the unused values. When choosing an unused value, assign high probability to those columns with low deviations and low probabilities to those with high deviations. Slowly lower the probability of choosing an unused column. When testing, keep the best 100 results (or whatever value works) and iterate until the value of S_m is as close to 1 as reasonable.

To find the low value of S_m, instead favor the columns with high variance.

A better system would include the direction of the deviation from the average for each row; this may be difficult if there are many rows.

In any case this method should be tested and refined with 10,000 columns and the results compared with the known best results. When it produces results close to those expected, run it on the full system.

Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?