What is the Most Efficient Algorithm for Finding Duplicates in an Array?

  • Context:
  • Thread starter Thread starter ChrisVer
  • Start date Start date
  • Tags Tags
    Algorithm
Click For Summary

Discussion Overview

The discussion revolves around finding the most efficient algorithm for detecting duplicates in an array of numbers. Participants explore various methods, including brute force, sorting, hashing, and specialized data structures, while considering the implications of data structure and expected input characteristics.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests a brute force approach with a time complexity of O(N^2) and questions if there are faster methods.
  • Another participant proposes that if the numbers have a known maximum, hashing could achieve O(n) time complexity in most cases.
  • Some participants mention that hashing can reduce the number of full comparisons needed, but it does not eliminate them entirely.
  • Bloom Filters are suggested as a potential method for large datasets if some error probability is acceptable, though one participant emphasizes the need for exact duplicates.
  • Discussion includes the idea of using a hash function to track duplicates efficiently, with suggestions for using unordered maps in C++.
  • One participant raises concerns about the relevance of application context in determining algorithm efficiency, while others argue that data characteristics significantly influence performance.
  • There are mentions of alternative data structures, such as suffix trees, for specific cases of duplicate detection.
  • Participants discuss the implications of data structure and expected behavior on algorithm choice, including considerations of memory and computational resources.

Areas of Agreement / Disagreement

Participants express differing views on the importance of application context in choosing an algorithm, with some arguing that it is crucial while others maintain that fast duplicate checking should be independent of application specifics. The discussion remains unresolved regarding the optimal approach, with multiple competing views presented.

Contextual Notes

Participants note that the efficiency of algorithms may depend on the expected behavior of the data, including the number of duplicates and the order of processing. There are also discussions about the trade-offs between average and maximum processing times based on available resources.

ChrisVer
Science Advisor
Messages
3,372
Reaction score
465
Hi, I was wondering (and I am somewhat non-confident about my time measurement)... if I have an object that has several numbers (associated with each event): A = (A_0, A_1, A_2,...,A_{N-1})... what's the best way to find if there are duplicates in them?
At first I thought the simplest piece of code:
Code:
for (int i =0 ; i < A.size() ; i++ ){
  current = A[i];
  for(int j=i+1 ; j< A.size() ; j++){
     rest = A[j];
     if(rest == current) { count << "Duplicate found"; }
  }
}
If the size of A is N as I gave above, this code is like going to take N*(N-1)/2 time to finish (so it scales with N^2). Thus, having millions of numbers in the array can be problematic.

Another way would be the sorting of the array A and the binary search... The sort of the form:
Code:
std::sort( A.begin() , A.end() )
is taking N*log(N) and the binary search is taking log(N) ... so the search for a duplicate is taking time (N+1)*log(N).

Are there any faster ways? (Like applying some binary function of
Code:
std::sort( A.begin() , A.end() , [](int a, int b) { return a==b; }
? (avoids the binary search as then I will only need the length of the resulting array)

One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
 
Technology news on Phys.org
If your numbers have some known maximum, there can be other methods that get their runtime from this maximum. If not (or if this maximum is too large), I don't see how you could beat the sorting-like algorithms in the worst (!) case. A suitable hash algorithm and storing which hash you saw already will be O(n) in most cases.
ChrisVer said:
One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
Then you can directly do something like heapsort, and you'll find the duplicate as soon as you try to add it.
 
  • Like
Likes   Reactions: QuantumQuest
In general: hashing.
- - - -
If you have a massive amount of data:
If you can tolerate some small error probability of false positives, Bloom Filters can be a slick way to go.

If things are highly and nicely structured, you may be able to use bit arrays btw.
 
  • Like
Likes   Reactions: FactChecker
my data/events are structured within TTree (TBranches) formats...
https://root.cern.ch/doc/master/classTTree.html
and yes, it's quiet massive but I don't have the luxury of small errors (I am actually searching for such a possibility)
 
A built-in language hash function will take care of collisions that would otherwise cause errors. I think this is an unordered map in C++. As each new number is obtained, just test if the hash entry of it (as a key) already exists. If so, it is a duplicate. Then set the hash entry of it to something, so that any future occurrences will be flagged as a duplicate. I have often found it useful to set the hash entry to the accumulated position indices of the occurrences of that key. That makes it easy to locate the duplicates in a large list.
 
ChrisVer said:
but I don't have the luxury of small errors

Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.

If you have a positive, false or otherwise, it means you need to do the full comparison. The hash table is a device that dramatically reduces - but does not eliminate - the number of times you need to do the full comparison. Furthermore, as the algorithm designer, you have control over the size of your hash table, and thus the degree of false positives: you can make that rate arbitrarily small.

Also, getting the exact application from you is like pulling teeth? Why so cagey? If you're trying to remove duplicate events (my best guess based on what you have said), why not say so? And if you're trying to do something else, why not say that? In the duplicate event case, the fact that your data is already almost sorted makes a big difference in how long it takes to run. If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
 
  • Like
Likes   Reactions: Tom.G
Vanadium 50 said:
Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.
Thanks, I will keep that in mind although -to my defence- it doesn't apply to this case [I don't see for which case it applies as you seem to imply it's a common thing coming from me]. Dismissing an answer is part of understanding the "problem".
Also I am not asking for "solutions" - I already have them at hand and posted them in the OP. I am asking for alternative algorithms [if anyone knows of one] that may be able to work faster. (I've been pointed to hashes).

Vanadium 50 said:
If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
 
ChrisVer said:
Fast checking for duplicates shouldn't be application-dependent.
It is (as indicated by multiple users in this thread). What is fast and what is not depends on your data.
 
Although the speed may depend on the data, trying to modify your algorithm to the data may be a complication that you don't want to get into. In general, doing a sort just to determine duplicates can take an unnecessarily long time. A hash function is significantly faster.
 
  • Like
Likes   Reactions: ChrisVer
  • #11
ChrisVer said:
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
It's very very relevant.
There are potential O(n) solutions - depending on how the data is expected to behave; what computing resources and memory you have available; and how the data is stored.
 
  • #12
.Scott said:
depending on how the data is expected to behave
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...

.Scott said:
what computing resources and memory you have available
This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.

.Scott said:
how the data is stored.
yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
 
  • #13
ChrisVer said:
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...
Things that are important are how many collisions (duplicates) are expected; is there a preferred order or timing in the further processing of the duplicates, or will they be simply be enumerated or discarded; are we interested in a low average processing time or a low maximum processing time - and what is the trade-off

ChrisVer said:
This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.
That can't be more true. You are talking to someone who started on a Honeywell series 200 computer - with 16Kbytes of memory; 3 9Kc table drives; and a card sort/punch. What is critical here is where will your data sets end up. If you keep the size to 100Mbytes, will it all end up in unpaged memory? If so, then something based on a bin sort (O(n)) may be very practical.

ChrisVer said:
yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
If it's in a file, then the best you can do is read it in and process it at the same time - so that the elapsed time is the same as simply reading through the file. But will you have to reexamine some parts of that file on a second pass?
If it's already in memory, then we need to look at the structure - we may find opportunities there - especially if you control the code the generates that memory image. And if it is in memory, is it paged?
 
  • Like
Likes   Reactions: mfb

Similar threads

  • · Replies 25 ·
Replies
25
Views
3K
Replies
20
Views
2K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 29 ·
Replies
29
Views
4K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
235
Views
15K
Replies
9
Views
3K