Is there any algorithm to find coincidences between 2 lists?

  • Thread starter Thread starter ORF
  • Start date Start date
  • Tags Tags
    Algorithm
Click For Summary

Discussion Overview

The discussion revolves around finding coincidences between two lists containing disordered items. Participants explore various algorithms and approaches to efficiently identify common elements, considering the challenges posed by large data sets.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant suggests sorting the lists to simplify finding coincidences but expresses concern about the inefficiency of this method for large lists.
  • Another participant proposes using a hash set to store items from the first list and check for membership with items from the second list, noting that this might be slower than sorting.
  • A different viewpoint emphasizes the efficiency of sorting algorithms like quicksort or mergesort, suggesting that they generally perform better than direct comparisons for large lists.
  • Several participants provide code snippets in C++ and Python to illustrate the use of hash sets and the intersection method for finding common items.
  • One participant mentions the importance of implementing the solution clearly before optimizing for performance, highlighting the potential complexities of cache effects and Big-O analysis.
  • Another participant raises concerns about the performance of different approaches, suggesting that testing multiple methods may be necessary to determine the most efficient solution.

Areas of Agreement / Disagreement

Participants express a range of opinions on the best approach to finding coincidences, with no consensus reached on a single optimal method. Some favor hash sets for their speed, while others advocate for sorting algorithms, indicating a debate over the efficiency of different techniques.

Contextual Notes

Participants note that the performance of algorithms can vary significantly based on the size of the lists and the specific characteristics of the data. There are also discussions about the potential pitfalls of using unordered sets and the importance of testing various implementations.

Who May Find This Useful

This discussion may be useful for programmers and computer scientists interested in algorithm design, particularly those dealing with data structures and performance optimization in C++ or Python.

ORF
Messages
169
Reaction score
19
Hello

I need to find coincidences between items in 2 lists. The lists have disordered items, so the first idea I had was to sort the items in the correct order and then finding the coincidences would be easy. The main problem is the big amount of items in the lists. I was using "double-linked lists" in order to sort the items.

Another idea is to compare each item of one list with all the items in the other list.

The two methods seemed to me horribly slow, so my doubt is: is there a better algorithm to find coincidences between two lists?

Thanks in advance :)

Greetings
 
Technology news on Phys.org
What language are you using?
 
You could put all the items of the first list into a set (e.g. a hash set) and then check if any of the items of the second list are part of that set.
But that is going to be slower than the sorting approach. How long are the lists? And do you just want to find all the elements that are contained in both lists regardless of their frequency?
 
  • Like
Likes   Reactions: FactChecker
Although I cannot talk in the abstract about any kind of list and process on them, a good sorting algorithm like quicksort or even mergesort, have good running times. Of course, an amortized analysis must be done with the specific data in the lists taken into account, but if they are long enough, good running time sorting algorithms will generally do better than other approaches. Direct comparison between the two lists, will be very inefficient (n2 sort of thing).
 
IF you could tell us your platform and language(s) we could actually give you a solution. For example linux/awk:
Code:
 awk 'FILENAME=="file1" {arr[$0]++; next}
         FILENAME=="file2"  {if ($0 in arr) { print } } '  file1  file2 > duplicates
FWIW this uses associative arrays (hashes).
 
DrZoidberg said:
You could put all the items of the first list into a set (e.g. a hash set) and then check if any of the items of the second list are part of that set.
Using Python you can put the members of the two lists into two sets, and then find the intersection of the two sets using the intersection() method.
 
  • Like
Likes   Reactions: jim mcnamara
DrZoidberg said:
You could put all the items of the first list into a set (e.g. a hash set) and then check if any of the items of the second list are part of that set.
But that is going to be slower than the sorting approach.
I believe that the set approach is almost always going to be faster. Assuming you use a hash table based set, the amortized time complexity would be O(n). Sorting alone is in O(nlogn).

In C++ the thing to use is std::unordered_set, or std::unordered_map if you want to count occurrences. In Java it's HashSet, HashMap. In python, set, dict.

C:
//c++ example

std::unordered_set s;
std::unordered_set common;

for( const auto & item : list1 ) {
    s.insert( item );
}
for( const auto & item : list2 ) {
    if( s.find( item ) != s.end() ) {
        common.insert( item );
    }
}
 
Last edited:
  • Like
Likes   Reactions: Silicon Waffle
Hello

Wow, I didn't expect so many answers, thank you! :D

I can use C/C++ (I'm still learning C++ 11), Python or Visual Basic, but I think C++ is the fastest.

@DrZoidberg: that idea works fine with short lists, but with the long ones... The other thing: the items will be no repeated in the same list.

@ Jarvis323: the "find" function is fast enough?

Thank you for your time :)

Greetings!
 
First implement in the clearest, most maintainable way, and only then address performance as needed. In C++, the clearest expression of this problem using standard algorithms may be with std::set_intersection (http://en.cppreference.com/w/cpp/algorithm/set_intersection). Note that despite the name it is not limited to std::sets.

If you need to optimize, test. Today, cache effects make it much more difficult than it once was to predict optimal solutions based only on Big-O. In your testing of alternatives, favour compact representations of the data, to maximize benefit from cacheing. For example, try using std::vector rather than std::list.
 
Last edited:
  • Like
Likes   Reactions: ORF and Silicon Waffle
  • #10
PHP:
std::set<int> vec1, vec2, common;
//prepare 2 sets
    for (int i = 0; i < 10; i++)
    {
        vec1.insert(i);
        if (i % 2 == 0)
            vec2.insert(i);
    }
   
// Make the common set
    for (std::set<int>::iterator it = vec2.begin(); it!=vec2.end(); it++)
    {
        if (!vec1.insert(*it).second)
        {
            common.insert(*it);
        }
    }
 
  • #11
Hello

@Integrand: I did not know the library "algorithm", thank you! :D
@Silicon Waffle: in the page
http://en.cppreference.com/w/cpp/algorithm/set_intersection (just for accuracy, the proper function in order to look just for coincidences between 2 lists would be set_intersection; I think).
It's explained that the complexity scales with N1+N2. With that code, I think you pass thought all the items in both lists, so the complexity scales with N1*N2. In my case, N1 and N2 will be thousands of millions of items.

Thanks you the information! :D

Greetings
 
  • #12
ORF said:
I need to find coincidences between items in 2 lists. The lists have disordered items, so the first idea I had was to sort the items in the correct order and then finding the coincidences would be easy. The main problem is the big amount of items in the lists. I was using "double-linked lists" in order to sort the items.
Based on subsequence answers, you're using C++. In C++, it's generally better to use a std::vector rather than a std::list. An exception would be if you are inserting or deleting anywhere but the end. But if you're just building the container up one element at a time, use a vector.
Jarvis323 said:
C:
std::unordered_set s;
for( const auto & item : list1 ) {
    s.insert( item );
}
It's probably faster here to use std::unordered_set s(list1.begin(), list1.end()). The implementation might take advantage of the size of the container to set the bucket size.

C:
for(constauto& item : list2 ){
   if( s.find( item )!= s.end()){
        common.insert( item );
   }
}
Personal preference, I would use if (s.count(item) != 0) { ... }. (Even better would be if there was a std::unordered_set::contains member function that returns a boolean.)
Integrand said:
First implement in the clearest, most maintainable way, and only then address performance as needed. In C++, the clearest expression of this problem using standard algorithms may be with std::set_intersection (http://en.cppreference.com/w/cpp/algorithm/set_intersection). Note that despite the name it is not limited to std::sets.
It is however limited to objects that are sorted, and sorting is an O(N*log(N)) algorithm.

Today, cache effects make it much more difficult than it once was to predict optimal solutions based only on Big-O. In your testing of alternatives, favour compact representations of the data, to maximize benefit from cacheing. For example, try using std::vector rather than std::list.
This is key. It might well be faster to use a fast O(N*log(N)) algorithm (e.g., sorting a pair of std::vectors in-place, and then using std::set_intersection to find the intersection) than it is to use a slow O(N) algorithm (e.g., constructing a std::unordered_set from one of the containers, and then using a loop to find the intersection). There's also the risk of encountering worst case performance using std::unordered_set. Constructing an unordered set from a container is nominally linear in the size of the container, but it can be quadratic.

If performance matters (and from other posts by the OP, it appears performance does matter), the best thing to do is to implement multiple approaches and see which wins. But the first thing to do is to test whether performance does matter. Implement it simply and test. If there aren't any performance issues, you're done.
Silicon Waffle said:
C:
    // Make the common set
    for (std::set<int>::iterator it = vec2.begin(); it!=vec2.end(); it++)
    {
        if (!vec1.insert(*it).second)
        {
            common.insert(*it);
        }
    }
There's no need to use if (!vec1.insert(*it).second). Either use Jarvis323's if( s.find( item )!= s.end()), or my suggested alternative, if (s.count(item)!= 0).
ORF said:
in the page
http://en.cppreference.com/w/cpp/algorithm/set_intersection (just for accuracy, the proper function in order to look just for coincidences between 2 lists would be set_intersection; I think).
It's explained that the complexity scales with N1+N2. With that code, I think you pass thought all the items in both lists, so the complexity scales with N1*N2. In my case, N1 and N2 will be thousands of millions of items.
You are forgetting that sorting is O(N*log(N)), and your lists are unsorted. You'll have to sort them before you can use std::set_intersection, so the cost is dominated by the two sorts (O(N1*log(N1) +N2*log(N2)). Compare that to the nominally linear cost of constructing a std::unordered_set from a container, and the nominally O(1) cost of looking up an element in an unordered set. The whole algorithm using an unordered set is nominally O(N1+N2). But ... that's assuming nominal behavior. There is the possibility of running into worst case performance, in which case a set-based algorithm does indeed becomes O(N1*N2).
 
  • Like
Likes   Reactions: Jarvis323 and ORF
  • #13
Hi

I've tried with different containers, and the best option (at least for me) is a map (C++).

Thank you for your help :)

Greetings!
 

Similar threads

Replies
9
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 11 ·
Replies
11
Views
5K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K