Finding duplicate entries in a list in Python

A common problem in computing is to find out if a list contains duplicate entries.

It’s also a good subject to explore various ways to solve a problem. In this case, we will see the performance difference between a trivial solution to this problem and an improved solution.

Find duplicates in a Python list

The trivial way to solve this problem is to scan each element of the list against every other element in the list. This will undoubtedly return the correct answer, and will work in reasonable timeframes for small lists, but it will very quickly slow down as the size of the list increases.

def check_duplicate_trivial(items):
  for idx in range(len(items)):
      for inner in range(len(items)):
          if inner == idx:
              continue  # do not compare to itself
          if items[inner] == items[idx]:
              return True
  return False

This is a loop within a loop — the outer loop is the index of the item we are comparing, and all the other items are indexed using the inner loop. You can see that this trivial implementation is really bad for large lists – it squares the number of items comparisons.

We can slightly improve the method versus completely scanning the list. If the function finds a duplicate, then it ends via returning at that point. For the majority of cases this will be significantly faster, and at worst it will still compare every element to every other element.

Improved implementation: Check an existing list

So how can we improve upon this, and fix the issue of the number of comparisons being the square of the number of items?

An improvement on this is to keep a second list of unique items already seen, and then check that list to see if the item already exists there.

def check_duplicate_improved(items):
  already_seen = {}  
  for item in items:
      if item in already_seen:
          return True
      already_seen[item] = 0
  return False

This means we only scan the main list once. We do scan the unique items list for each entry in the main list, but even in a worst case this greatly reduces the number of comparisons.

Fastest implementation: Use a Set lookup

Those two examples let us check for duplicates. The improved algorithm is much faster than the trivial algorithm, but it’s still fairly slow with a lot of comparisons. It will still take significant time to complete if given a large list.

Does the Pythonic system have something built for this specific purpose, so that we can avoid all of these comparisons?

Yes, we can use what is called a Set.

A Set is backed using a hash table. This computes a hash value for the object being added and uses that as a key to store the object when it is added. Objects can have the same hash value, and in larger sets it is common to use the hash algorithm to balance the objects over the available hashes to keep performance consistent.

The hash can be considered a pointer or direct reference into the underlying data structure, which makes lookup very fast — generally a Set lookup is a single operation. This is what makes it a great choice to turn searching items problems into constant time problems.

def check_duplicate_set(items):
  hash_bucket = set()
  for item in items:
      if item in hash_bucket:
          return True
  return False

As you see in the code above, we check to see if the item is in the Set before we add it, and if it is we return We have now reduced the check for duplicates to a single scan of the original items.

This function can also be reduced to a single line. To do so, create a set from the original items (sets are guaranteed to have only unique entries) and compare the size of the two:

def check_duplicate_distinct(items):
  return len(items) != len(set(items))

This code is both compact and efficient. In Python, for most situations where you want to detect duplicates, the Set is the best method to use for performance.

App Architecture
Software Quality
Cloud Computing