What is the Most Efficient Algorithm for Finding Duplicates in an Array?

  • Thread starter ChrisVer
  • Start date
  • Tags
    Algorithm
In summary: You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.
  • #1
ChrisVer
Gold Member
3,378
464
Hi, I was wondering (and I am somewhat non-confident about my time measurement)... if I have an object that has several numbers (associated with each event): [itex]A = (A_0, A_1, A_2,...,A_{N-1})[/itex]... what's the best way to find if there are duplicates in them?
At first I thought the simplest piece of code:
Code:
for (int i =0 ; i < A.size() ; i++ ){
  current = A[i];
  for(int j=i+1 ; j< A.size() ; j++){
     rest = A[j];
     if(rest == current) { cout << "Duplicate found"; }
  }
}
If the size of A is [itex]N[/itex] as I gave above, this code is like going to take N*(N-1)/2 time to finish (so it scales with [itex]N^2[/itex]). Thus, having millions of numbers in the array can be problematic.

Another way would be the sorting of the array A and the binary search... The sort of the form:
Code:
std::sort( A.begin() , A.end() )
is taking N*log(N) and the binary search is taking log(N) ... so the search for a duplicate is taking time [itex] (N+1)*log(N) [/itex].

Are there any faster ways? (Like applying some binary function of
Code:
std::sort( A.begin() , A.end() , [](int a, int b) { return a==b; }
? (avoids the binary search as then I will only need the length of the resulting array)

One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
 
Technology news on Phys.org
  • #2
If your numbers have some known maximum, there can be other methods that get their runtime from this maximum. If not (or if this maximum is too large), I don't see how you could beat the sorting-like algorithms in the worst (!) case. A suitable hash algorithm and storing which hash you saw already will be O(n) in most cases.
ChrisVer said:
One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
Then you can directly do something like heapsort, and you'll find the duplicate as soon as you try to add it.
 
  • Like
Likes QuantumQuest
  • #3
In general: hashing.
- - - -
If you have a massive amount of data:
If you can tolerate some small error probability of false positives, Bloom Filters can be a slick way to go.

If things are highly and nicely structured, you may be able to use bit arrays btw.
 
  • Like
Likes FactChecker
  • #4
my data/events are structured within TTree (TBranches) formats...
https://root.cern.ch/doc/master/classTTree.html
and yes, it's quiet massive but I don't have the luxury of small errors (I am actually searching for such a possibility)
 
  • #5
A built-in language hash function will take care of collisions that would otherwise cause errors. I think this is an unordered map in C++. As each new number is obtained, just test if the hash entry of it (as a key) already exists. If so, it is a duplicate. Then set the hash entry of it to something, so that any future occurrences will be flagged as a duplicate. I have often found it useful to set the hash entry to the accumulated position indices of the occurrences of that key. That makes it easy to locate the duplicates in a large list.
 
  • #6
ChrisVer said:
but I don't have the luxury of small errors

Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.

If you have a positive, false or otherwise, it means you need to do the full comparison. The hash table is a device that dramatically reduces - but does not eliminate - the number of times you need to do the full comparison. Furthermore, as the algorithm designer, you have control over the size of your hash table, and thus the degree of false positives: you can make that rate arbitrarily small.

Also, getting the exact application from you is like pulling teeth? Why so cagey? If you're trying to remove duplicate events (my best guess based on what you have said), why not say so? And if you're trying to do something else, why not say that? In the duplicate event case, the fact that your data is already almost sorted makes a big difference in how long it takes to run. If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
 
  • Like
Likes Tom.G
  • #7
Vanadium 50 said:
Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.
Thanks, I will keep that in mind although -to my defence- it doesn't apply to this case [I don't see for which case it applies as you seem to imply it's a common thing coming from me]. Dismissing an answer is part of understanding the "problem".
Also I am not asking for "solutions" - I already have them at hand and posted them in the OP. I am asking for alternative algorithms [if anyone knows of one] that may be able to work faster. (I've been pointed to hashes).

Vanadium 50 said:
If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
 
  • #8
ChrisVer said:
Fast checking for duplicates shouldn't be application-dependent.
It is (as indicated by multiple users in this thread). What is fast and what is not depends on your data.
 
  • #9
Although the speed may depend on the data, trying to modify your algorithm to the data may be a complication that you don't want to get into. In general, doing a sort just to determine duplicates can take an unnecessarily long time. A hash function is significantly faster.
 
  • Like
Likes ChrisVer
  • #11
ChrisVer said:
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
It's very very relevant.
There are potential O(n) solutions - depending on how the data is expected to behave; what computing resources and memory you have available; and how the data is stored.
 
  • #12
.Scott said:
depending on how the data is expected to behave
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...

.Scott said:
what computing resources and memory you have available
This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.

.Scott said:
how the data is stored.
yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
 
  • #13
ChrisVer said:
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...
Things that are important are how many collisions (duplicates) are expected; is there a preferred order or timing in the further processing of the duplicates, or will they be simply be enumerated or discarded; are we interested in a low average processing time or a low maximum processing time - and what is the trade-off

ChrisVer said:
This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.
That can't be more true. You are talking to someone who started on a Honeywell series 200 computer - with 16Kbytes of memory; 3 9Kc table drives; and a card sort/punch. What is critical here is where will your data sets end up. If you keep the size to 100Mbytes, will it all end up in unpaged memory? If so, then something based on a bin sort (O(n)) may be very practical.

ChrisVer said:
yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
If it's in a file, then the best you can do is read it in and process it at the same time - so that the elapsed time is the same as simply reading through the file. But will you have to reexamine some parts of that file on a second pass?
If it's already in memory, then we need to look at the structure - we may find opportunities there - especially if you control the code the generates that memory image. And if it is in memory, is it paged?
 
  • Like
Likes mfb

What is a "Finding duplicates algorithm"?

A "Finding duplicates algorithm" is a set of instructions or calculations used to identify and remove duplicate elements from a dataset. It is commonly used in data cleaning and organization processes in various fields such as computer science, mathematics, and statistics.

How does a "Finding duplicates algorithm" work?

The specific steps of a "Finding duplicates algorithm" may vary, but generally, it involves comparing each element in a dataset to all other elements and identifying duplicates based on predefined criteria. These criteria can include an exact match, similarity threshold, or a combination of factors.

What are the benefits of using a "Finding duplicates algorithm"?

Using a "Finding duplicates algorithm" can save time and resources by automating the process of identifying and removing duplicate elements from a dataset. It also helps to improve the accuracy and reliability of data, as duplicate entries can lead to errors and inconsistencies.

What are some common applications of "Finding duplicates algorithms"?

"Finding duplicates algorithms" are commonly used in various fields, including database management, data mining, image processing, and text analysis. They are also widely used in industries such as finance, healthcare, and e-commerce to improve data quality and efficiency.

How can I choose the best "Finding duplicates algorithm" for my data?

The best "Finding duplicates algorithm" will depend on the type of data you are working with and the specific criteria for identifying duplicates. It is important to consider factors such as the size of the dataset, the complexity of the data, and the desired level of accuracy when selecting an algorithm.

Similar threads

  • Programming and Computer Science
Replies
25
Views
2K
  • Programming and Computer Science
Replies
20
Views
1K
  • Programming and Computer Science
Replies
29
Views
1K
  • Programming and Computer Science
Replies
12
Views
1K
  • Programming and Computer Science
Replies
2
Views
633
  • Programming and Computer Science
3
Replies
75
Views
4K
  • Programming and Computer Science
Replies
12
Views
1K
Replies
9
Views
1K
  • Programming and Computer Science
Replies
23
Views
1K
  • Programming and Computer Science
Replies
13
Views
3K
Back
Top