Finding duplicates algorithm

  • C/++/#
  • Thread starter ChrisVer
  • Start date
  • #1
ChrisVer
Gold Member
3,337
440

Main Question or Discussion Point

Hi, I was wondering (and I am somewhat non-confident about my time measurement)... if I have an object that has several numbers (associated with each event): [itex]A = (A_0, A_1, A_2,...,A_{N-1})[/itex]... what's the best way to find if there are duplicates in them?
At first I thought the simplest piece of code:
Code:
for (int i =0 ; i < A.size() ; i++ ){
  current = A[i];
  for(int j=i+1 ; j< A.size() ; j++){
     rest = A[j];
     if(rest == current) { cout << "Duplicate found"; }
  }
}
If the size of A is [itex]N[/itex] as I gave above, this code is like going to take N*(N-1)/2 time to finish (so it scales with [itex]N^2[/itex]). Thus, having millions of numbers in the array can be problematic.

Another way would be the sorting of the array A and the binary search... The sort of the form:
Code:
std::sort( A.begin() , A.end() )
is taking N*log(N) and the binary search is taking log(N) ... so the search for a duplicate is taking time [itex] (N+1)*log(N) [/itex].

Are there any faster ways? (Like applying some binary function of
Code:
std::sort( A.begin() , A.end() , [](int a, int b) { return a==b; }
? (avoids the binary search as then I will only need the length of the resulting array)

One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
 

Answers and Replies

  • #2
34,396
10,480
If your numbers have some known maximum, there can be other methods that get their runtime from this maximum. If not (or if this maximum is too large), I don't see how you could beat the sorting-like algorithms in the worst (!) case. A suitable hash algorithm and storing which hash you saw already will be O(n) in most cases.
One problem which I haven't yet mentioned is that I don't have the array in an array format (I'll have to create it, so I will get +N more time to do that).
Then you can directly do something like heapsort, and you'll find the duplicate as soon as you try to add it.
 
  • Like
Likes QuantumQuest
  • #3
StoneTemplePython
Science Advisor
Gold Member
2019 Award
1,158
564
In general: hashing.
- - - -
If you have a massive amount of data:
If you can tolerate some small error probability of false positives, Bloom Filters can be a slick way to go.

If things are highly and nicely structured, you may be able to use bit arrays btw.
 
  • Like
Likes FactChecker
  • #4
ChrisVer
Gold Member
3,337
440
my data/events are structured within TTree (TBranches) formats...
https://root.cern.ch/doc/master/classTTree.html
and yes, it's quiet massive but I don't have the luxury of small errors (I am actually searching for such a possibility)
 
  • #5
FactChecker
Science Advisor
Gold Member
5,591
2,074
A built-in language hash function will take care of collisions that would otherwise cause errors. I think this is an unordered map in C++. As each new number is obtained, just test if the hash entry of it (as a key) already exists. If so, it is a duplicate. Then set the hash entry of it to something, so that any future occurrences will be flagged as a duplicate. I have often found it useful to set the hash entry to the accumulated position indices of the occurrences of that key. That makes it easy to locate the duplicates in a large list.
 
  • #6
Vanadium 50
Staff Emeritus
Science Advisor
Education Advisor
2019 Award
24,596
7,506
but I don't have the luxury of small errors
Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.

If you have a positive, false or otherwise, it means you need to do the full comparison. The hash table is a device that dramatically reduces - but does not eliminate - the number of times you need to do the full comparison. Furthermore, as the algorithm designer, you have control over the size of your hash table, and thus the degree of false positives: you can make that rate arbitrarily small.

Also, getting the exact application from you is like pulling teeth? Why so cagey? If you're trying to remove duplicate events (my best guess based on what you have said), why not say so? And if you're trying to do something else, why not say that? In the duplicate event case, the fact that your data is already almost sorted makes a big difference in how long it takes to run. If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
 
  • Like
Likes Tom.G
  • #7
ChrisVer
Gold Member
3,337
440
Chris, one of the things I have noticed is that you keep asking questions until your problem is exactly solved. You don't take the idea of any of the previous messages and develop them yourself. If you want a career in the sciences, you need to get past this.
Thanks, I will keep that in mind although -to my defence- it doesn't apply to this case [I don't see for which case it applies as you seem to imply it's a common thing coming from me]. Dismissing an answer is part of understanding the "problem".
Also I am not asking for "solutions" - I already have them at hand and posted them in the OP. I am asking for alternative algorithms [if anyone knows of one] that may be able to work faster. (I've been pointed to hashes).

If you leave it to us to guess your application, you run the risk of getting an answer which may be optimal in the general case, but not your specific case.
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
 
  • #8
34,396
10,480
Fast checking for duplicates shouldn't be application-dependent.
It is (as indicated by multiple users in this thread). What is fast and what is not depends on your data.
 
  • #9
FactChecker
Science Advisor
Gold Member
5,591
2,074
Although the speed may depend on the data, trying to modify your algorithm to the data may be a complication that you don't want to get into. In general, doing a sort just to determine duplicates can take an unnecessarily long time. A hash function is significantly faster.
 
  • Like
Likes ChrisVer
  • #11
.Scott
Homework Helper
2,536
914
Not cagey at all, the problem I was trying to address is irrelevant to the application. Fast checking for duplicates shouldn't be application-dependent.
It's very very relevant.
There are potential O(n) solutions - depending on how the data is expected to behave; what computing resources and memory you have available; and how the data is stored.
 
  • #12
ChrisVer
Gold Member
3,337
440
depending on how the data is expected to behave
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...

what computing resources and memory you have available
This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.

how the data is stored.
yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
 
  • #13
.Scott
Homework Helper
2,536
914
It seems like I don't get the way you imagine data.
the data can be: "red","green","blue". Or it can be "Marie","Nick","Donald" . Or it may be 1,2,3 ...
So supposing your data sequence is ["Nick", "Maria", "Josh" , "Nick" , "Donald"] , is there any kind of "behavior"? Similarly for raw numbers...
Things that are important are how many collisions (duplicates) are expected; is there a preferred order or timing in the further processing of the duplicates, or will they be simply be enumerated or discarded; are we interested in a low average processing time or a low maximum processing time - and what is the trade-off

This is somewhat interesting, but I suppose it only tells you how hard you can be with your algorithms and not optimizing it? I mean the more resources someone has available, the more freedom they have in choosing not to go optimally.
That can't be more true. You are talking to someone who started on a Honeywell series 200 computer - with 16Kbytes of memory; 3 9Kc table drives; and a card sort/punch. What is critical here is where will your data sets end up. If you keep the size to 100Mbytes, will it all end up in unpaged memory? If so, then something based on a bin sort (O(n)) may be very practical.

yup, I said arrays (or vectors or whatever you want to call it)... but if it was a txt file, it wouldn't make much of a difference because you would eventually have to add them in an array right?
If it's in a file, then the best you can do is read it in and process it at the same time - so that the elapsed time is the same as simply reading through the file. But will you have to reexamine some parts of that file on a second pass?
If it's already in memory, then we need to look at the structure - we may find opportunities there - especially if you control the code the generates that memory image. And if it is in memory, is it paged?
 
  • Like
Likes mfb

Related Threads on Finding duplicates algorithm

  • Last Post
Replies
4
Views
2K
  • Last Post
Replies
1
Views
2K
Replies
1
Views
3K
Replies
10
Views
7K
Replies
19
Views
3K
Replies
2
Views
2K
Replies
12
Views
1K
Replies
13
Views
1K
Top