Engineering

Removing Duplicate Elements from a JavaScript Array (part 1)

Removing Duplicate Elements from a JavaScript Array (part 1)

One of the features we wanted to add to the company website was the ability to tag and filter posts so that these posts can then be displayed in their own index page. That meant that we needed to read through all the YAML frontmatter, collate the tags into a collection, then deduplicate the collection so we can create an index page per tag.

We set out to do just that, until we hit a snag: what's the best way to deduplicate a JavaScript array (since the website was using Next.js)?

You would have expected that JavaScript, being a mature language used in almost all modern websites, would have a built-in method for removing duplicate elements from an array.

For example, Ruby's arrays have a built-in method uniq that does what you'd expect:

array = [
  "scrum",
  "scrum",
  "scrum",
  "engineering",
  "culture",
  "culture",
  "remote-work",
  ]
array.uniq

There's no such thing in JavaScript land, so we set out and looked towards the vast plains of the Internet searching for whatever wisdom we can manage to get.

We eventually ended settling with one of them, but we realized that there might be some value in exploring the variety of solutions that we've collected. We tried to ask ourselves: How do they work and how they fare in terms of performance?

So we took a few of those solutions that we felt best represented a certain concept, and ran benchmarks on them.

const iterations = 1000000;
const array = [
  "scrum",
  "scrum",
  "scrum",
  "engineering",
  "culture",
  "culture",
  "remote-work",
];

const methods = {
  usingSet: (arr) => [...new Set(arr)],

  usingFilter: (arr) => arr.filter((e, i, a) => a.indexOf(e) === i),

  usingFilterIncludes: (arr) =>
    arr.filter(function (e) {
      return !this.includes(e) && this.push(e);
    }, []),

  usingObjectKeys: (arr) =>
    Object.keys(arr.reduce((acc, cur) => (acc[cur] = cur && acc), {})),

  usingMagic: (arr, t = {}) => arr.filter((e) => !(t[e] = e in t)),
};

const performanceObj = Object.entries(methods).reduce((acc, [key, value]) => {
  const start = performance.now();
  for (let i = 0; i < iterations; i++) value(array);
  acc[key] = `${performance.now() - start} ms`;
  return acc;
}, {});

performanceObj;

We actually have two more examples for deduplicating JavaScript arrays, but they either require external libraries or are currently only considered experiemental technologies that a polyfill is necessary so we won't be digging in too deep with them.

// use underscore.js (advantage is that they take optimizing the algorithm)
_.uniq([array]);

// groupBy and keys, experimental and needs polyfill
Object.keys(array.groupBy((e) => e));

The benchmarks fluctuate depending on the resources your browser is currently using, but the ranking mostly remains the same. For reference, these are the results that we got:

Object {
  "usingFilter": "73 ms",
  "usingFilterIncludes": "100.89999997615814 ms",
  "usingMagic": "256.5 ms",
  "usingObjectKeys": "143.60000002384186 ms",
  "usingSet": "130.29999995231628 ms",
}

Let's look at them one by one, in order of benchmark performance.

Using filter() and indexOf()

const usingFilter = (arr) => arr.filter((e, i, a) => a.indexOf(e) === i);

We start off with the fastest of the bunch, and it's also sweet and succinct (albeit needing some thought to how it works).

One uncommon thing here would be the usage of all three arguments in the filter function: the current element, the current index, and the array being iterated on.

The filter function finds the index of e in the array a and checks if the current index i is the same. If so, that means that this is the first time that the current e was encountered because the current index and the location of e in the array are the same.

If they are not the same, this check will evaluate to false because indexOf will return the index of the earliest occurrence of e and the current index i would not be the same as that index.

Although the filter function looks like it's doing multiple passes over the array to find the index of the current element (at worst an O(n^2) complexity with all elements already unique), it should still be fast enough for simple use cases.

Using filter() and includes()

const usingFilterIncludes = (arr) =>
  arr.filter(function (e) {
    return !this.includes(e) && this.push(e);
  }, []);

While similar to the first one, this one uses some this-chicanery to avoid having to declare a temporary variable to hold the comparison list.

In its callback function mode (not using an arrow function), Array.prototype.filter() takes two arguments: the callback function (which can be defined or inline) and something called a thisArg. The thisArg is the value to be used as this when the callback is executed.

That means within the scope of the function, this would refer to [] since that was passed in as the second parameter to filter().

The callback function will then use includes() to check if e is present in the array; if not we negate this and continue on the boolean && and execute an array push of the element into the array.

This results in duplicates elements returning true after the includes() test, which is negated, which then short-circuits the boolean and does not continue on with the array push.

It's only slightly slower than the indexOf() version, but depending on who you ask it may be more understandable than that. The only unusual thing about this is the usage of the thisArg which can be surprising to some developers who are used to arrow functions and don't need to worry about what this references.

Using Set

const usingSet = (arr) => [...new Set(arr)];

We like this one a lot due to its simplicity and elegance. You simply create a new Set Object from the array, and then use the spread syntax to create a new array with the elements of the created Set. The MDN docs even provide this exact same example on deduplicating an array as part of its sample usage documentation.

It's almost like having unique elements is its sole reason for living (it is, mostly).

Why would you need to create an array from a Set when a Set should be expected behave much like an array, you might ask? If you look at the documentation of Set you'd notice that the array functions you'd mostly expect to use like map() or reduce() aren't present (you can still do forEach() though).

You also don't get any of array-like behaviors like being able to retrieve an element via its index using [] or being able to have "holes" in the sequence by directly assigning a value to non-sequential indexes.

This was what we ended up finally using in our implementation.

Using Object Keys

const usingObjectKeys = (arr) =>
  Object.keys(arr.reduce((acc, cur) => (acc[cur] = cur && acc), {}));

There are two tricks being (ab)used here:

  • JavaScript Object property names (aka keys) don't have duplicates
  • Array.prototype.reduce() takes two arguments: the reducer function, and an optional initial value

The reducer function then creates a property in the accumulator acc and returns the accumulator. As expected in reducer functions, the returned accumulator is made available to the next iteration.

The trick here is that if the property already exists, then no new property is created and the assignment simply happens to a property that is already there.

We then call Object.keys() to get the property names of the accumulator object in order to get back a deduplicated array.

While using an Object's properties as some sort of deduplication data structure was unavoidable in past versions of JavaScript, we now have constructs like Set or Map that are specialized for this kind of operations. In fact, the MDN documentation hints at Maps as a replacement for using Object for hash-like or dictionary-like structures.

The downside of using a Map instead of an Object is similar to that of using Set: you don't have familiar operators like map or reduce, and probably worse would be the confusion with setting Object properties vs Map keys using []=—after all, Map is still a full-fledged JavaScript Object.

Using Magic

const usingMagic = (arr, t = {}) => arr.filter((e) => !(t[e] = e in t));

This is a fun one. Similar to using Object keys, we initialize a temporary variable t and (ab)use the fact that JavaScript Object keys are unique. We then use Array.prototype.filter() to loop through the array and kick out duplicate elements.

So how does the actual filter function work? Let's dissect the filter function:

!(t[e] = e in t))
  • we start evaluating the expression right to left, starting with the in operator
  • the in operator checks whether e is present in the property list of Object t and returns true or false
  • the result (true or false) gets assigned to t[e] but this isn't really important; what we want is to create the key e in Object t if it doesn't exist yet
  • what's important though is that this expression evaluates to true or false depending on whether property e is present in Object t
  • since the expression returns true if e is in t, we want to negate this because if e is already present in t then we're dealing with a duplicate
  • Array.prototype.filter() will reject elements for which the filter function returns false, leaving us with a deduplicated array

There are a lot of tricks and gotchas here, and we do not recommend using this in production code unless you have specific reasons (like trying to confuse your colleagues so they pay attention to your code's review :P). It also came in dead last in the benchmark, taking almost double the time it took the second worst algorithm to run.

Going Beyond the Primitive Case

With all that said, we think Set checks all the boxes we were looking for to get our blog tags feature done. It also looks very trendy, using modern JavaScript constructs like the spread operator.

const dedupe = (arr) => [...new Set(arr)];

So there you have it, this is definitely the best way™ to remove duplicate elements from a JavaScript array.

Or is it?

Let's check our assumptions for a bit and try to see where our algorithms break down. Right now all we've been dealing with are arrays filled with strings. What would happen if our array happens to contain objects instead? Heck, let's throw in a couple of duplicated arrays in the mix as well.

Here's an example using Array.prototype.filter() and Array.prototype.indexOf()

const array = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
];

const usingFilter = (arr) => arr.filter((e, i, a) => a.indexOf(e) === i);

usingFilter(array);

It seems that we didn't get what we were expecting. Strangely, none of the duplicated elements were filtered out.

How about using the Set object?

const array = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
];

const usingSet = (arr) => [...new Set(arr)];

usingSet(array);

This didn't work either. We get the same result as the filter() and indexOf() version.

What gives?

Maybe the magical ways will fare better:

const array = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
];

const usingMagic = (arr, t = {}) => arr.filter((e) => !(t[e] = e in t));

usingMagic(array);

Now we get something different, but still is definitely not what we can consider as the "correct" result. One would have assumed that deduplication is such a simple task that it's such a surprise not to get the expected answer.

Speaking of assumptions, we've been talking about deduplication but neglecting to consider the elephant in the room: what does it mean for something to be a duplicate of another? In other words: what makes something different from another?

We've been taking for granted that the functions we've been using have implicit tests of equality that may not necessarily match what we expect from what "equality" means between array elements.

We'll dive deeper into existential questions of equality and truth in part 2.

Resources

Your vision deserves a dedicated software development team.

We're not about tech jargon or over-promising. Instead, we focus on clear communication, transparency in our process, and delivering results that speak for themselves.

Top software development company Malaysia awards