Hash Sets — Algorithm Learning Platform

Followers you both follow

Open Instagram, tap on Virat Kohli's profile, and look for the line that says "followers you both follow". One side of the comparison is you, with maybe 800 follows. The other side is Kohli, with around 27 crore followers. The result appears in the time it takes the page to render.

Nothing scanned 27 crore accounts. The trick is that the system iterated your 800, not Kohli's 27 crore — asking the larger set, "do you contain this account?" once per follow you have. That asymmetric move is what hash sets are for.

A hash set is a hash map with no values

The bucket array, the hashing, the chains, the collisions — all of it is identical to the hash-maps lesson. The only difference is the cell contents. A hash map's bucket holds key-value pairs. A hash set's bucket holds bare keys. There is no value column to look up; there is only "is this key here, yes or no?"

In the JDK, this isn't even a metaphor — it's the literal implementation:

public class HashSet<E> {
  private transient HashMap<E, Object> map;
  private static final Object PRESENT = new Object();
  // every add(e) is just map.put(e, PRESENT)
}

HashSet is a HashMap with a single shared dummy value. Go is even more honest about it: there is no built-in set type, and idiomatic Go code uses map[K]struct{}{} — struct{} being the zero-byte value that says "I only care about presence."

So why teach this as a separate topic at all? Because the API matters. Hash sets exist to express three things cleanly: membership, deduplication, and set algebra. The third is where the lesson earns its keep.

Operations — click to jump

construct(8)

8 buckets. Bare keys, no values.

Code

function makeSet(capacity) {
  return {
    buckets: Array.from({ length: capacity }, () => []),
    size: 0,
    capacity,
  };
}

State

Size: 0

The visualiser starts with eight empty buckets. Step through add("alice") and add("bob") and watch the keys land — the hash strip resolves, the chain at the destination bucket grows by one cell. No value pill. Just a key.

Now try add("alice") a second time. Watch what the visualiser does — and what it doesn't.

Bucket 2 already holds ["alice"]. We call add("alice") again. Which picture matches the bucket's chain immediately after this call returns?

The set is duplicate-free by definition. Calling add with a key that's already present is silently a no-op. That's not a quirk; it's the entire point. Wherever you'd write if (!seen.includes(x)) seen.push(x) with a list, you can write seen.add(x) with a set and stop tracking what you've already inserted.

The order of `has` is exactly the order of `get`

has(k) does two things. First it hashes k to a bucket. Then it walks that bucket's chain comparing keys. Same machinery as the get from the hash-maps lesson, with the value-extraction step deleted.

Drag these four steps into the order has("zara") executes on a chained hash set.

1.Compare each chain entry's key to "zara"
2.Compute hash("zara")
3.Read the bucket array at index hash mod capacity
4.Take hash mod capacity to get a bucket index

Notice what's missing from that order: scanning the bucket array. The hash takes you to one bucket and you only look at that bucket. This is the entire reason has is fast — and the entire reason it falls apart under collisions. Java's HashSet, sitting on top of HashMap, inherits the same red-black-tree fallback past 8 collisions in a bucket. Worst-case has becomes O(log n) instead of O(n) — a defence against the same algorithmic complexity attacks that hash maps face.

When `Set` is slower than `Array`

There's a load-bearing word in "hash sets are O(1)": amortised. The hash function isn't free. The bucket access has cache costs. For very small collections, the constant factor of all that machinery exceeds the cost of a flat linear scan.

You have a list of 5 user roles and need to check if a given role is in the list. Which is fastest in practice?

A flat array with array.includes(role)A Set with set.has(role)A sorted array with binary searchA balanced BST

V8 specifically optimises small Array.includes. For five user roles, a flat array beats a Set on every JavaScript engine. The crossover point is somewhere around 30–100 elements depending on key size and engine — past it, Set wins decisively, but reflexively reaching for Set for a 5-element collection pays a constant-factor tax for nothing. Big-O is a promise about scaling, not about constant-factor speed.

The operation that justifies the whole API

Once you know add and has, the interesting surface that's not in a hash map opens up: union, intersection, difference. Union is the obvious one — every key from either set, no duplicates. The cost is O(|a| + |b|) because you have to look at every element on both sides at least once.

Intersection looks symmetric on paper but is not symmetric in cost. Switch the visualiser to pair mode. Set A on the left has 3 keys. Set B on the right has 12.

function intersection(a, b) {
  const [small, large] = a.size <= b.size ? [a, b] : [b, a];
  const out = new Set();
  for (const k of small) {
    if (large.has(k)) out.add(k);
  }
  return out;
}

The first line is the entire trick. Pick the smaller side as the iteration source, and ask the larger side a yes/no question for each of those elements. The cost is min(|a|, |b|) lookups, not |a| + |b|.

You're computing the intersection of two sets. Set A has 200 members. Set B has 27 crore (270 million). Which set should your loop iterate?

A — the smaller oneB — the larger oneEither, they're equivalentIterate the concatenation of both, looking for keys present in both

Run the visualiser's "Replay (wrong direction)" once after the right-direction run. The result is the same — every key that's in both sets — but the probe count goes from 3 to 12. Now scale that up. Your 800 follows against Kohli's 27 crore: 800 probes, finishing in milliseconds. Doing it the other way would be 27 crore probes against your set. Same answer, four to five orders of magnitude more work.

This is exactly the move Redis makes inside SINTER. The Redis source for set intersection (t_set.c) explicitly sorts the input sets by size and iterates the smallest one, probing the rest. It's the difference between a "common followers" feature being a one-millisecond Redis call and being unshippable.

The bug that makes intersection mysteriously slow

The version of the function that doesn't pick the smaller side still works. It returns the right answer. It just does a lot more work than it needs to. That makes the bug invisible in tests and obvious only in production, on lopsided inputs.

This intersection function works on small inputs but is mysteriously slow when one set is much bigger than the other. Click the buggy step.

Most code review catches "off-by-one" bugs and "wrong return value" bugs. Asymmetric-cost bugs slip through because the output is correct — the function is just slow when the inputs happen to be lopsided. If your code does set intersection and has any chance of running on a 100-vs-100M pair, write the smaller-set check explicitly. Don't trust callers to pass the small side first.

What counts as "the same key"

The set is duplicate-free. But "duplicate" is defined by whatever the language's equality and hashing rules say it is — and those rules don't always match what your eyes see.

In JavaScript, what does this print? const s = new Set(); s.add({}); s.add({}); console.log(s.size);

JavaScript's Set uses SameValueZero equality, which for objects means reference identity. Two different object literals are two different members, even when their contents are identical. If you want value-based deduplication, you have to canonicalise first — usually by stringifying with JSON.stringify or by interning into a known object instance.

The same shape of trap shows up in Java with the equals/hashCode contract: forget to override either one, and HashSet quietly loses members. Python takes the third path and forbids mutable types as set members entirely — only hashable (immutable) types are allowed in a set.

The bug that makes entries vanish

The hash maps lesson covered one version of this — mutating a key after putting it in the map silently corrupts the map. Sets carry the same rule: mutating an element after adding it to the set silently loses it.

A developer adds an array [1, 2] to a set (in a language that allows it). They then push 3 onto the same array, making it [1, 2, 3]. Then they call set.has(theArray). What happens?

set.has returns false — the entry is silently lost from the set's perspectiveset.has returns true — the data is still there in memoryThe set rehashes automatically when it notices the mutationThe set throws a ConcurrentModificationException

The set hasn't lost anything in memory. The element is still sitting in the bucket where add put it. The lookup just hashes the (now-mutated) value, lands in a different bucket, and reports "not present". Same root cause as the hash-maps version of this bug, same defence: don't mutate elements after putting them in a set.

When the hash set is the wrong answer

Hash sets are the right tool for membership, dedup, and most set algebra. They become the wrong tool the moment you need any of these:

Ordered traversal. HashSet has no inherent order. Java's TreeSet (sorted) and LinkedHashSet (insertion-ordered) exist for exactly this gap. Python's set has no order at all; if you need order, use a dict (which preserves insertion order) and ignore the values. JavaScript's Set does iterate in insertion order — that's an ES2015 spec guarantee — but most other languages don't.
Range queries. "Give me every key between A and Z" is a tree-based question, not a hash question. Hashing destroys order; the bucket array has no notion of "between".
Approximate membership at extreme scale. A Bloom filter trades certainty for ~10× space savings. If you're tracking which URLs you've already crawled across a billion-page web crawl, the Bloom filter is what fits in RAM; the hash set isn't.
Frequency counting. A set tells you presence, not count. If you need to know "how many times has each value appeared", you want a hash map (or a multiset / Counter), not a set.

Where this shows up in real systems

Redis sets: SADD, SISMEMBER, SUNION, SINTER, SDIFF. Small integer-only sets are stored as a packed intset; past set-max-intset-entries (default 512) it switches to a hashtable. SINTERSTORE is the canonical "common followers" primitive — Twitter, Instagram, and most follower-graph services have implemented mutual-follow features on top of it.
CPython set: open addressing, same probing scheme as dict, value slot removed. About 25% smaller per entry than the equivalent dict. frozenset is the immutable variant — useful as a dict key, since regular sets aren't hashable.
Java HashSet: a thin shell over HashMap<E, Object> with one shared dummy value. Same red-black-tree fallback past 8 collisions in a bucket.
Go: no built-in set type. Idiomatic Go uses map[K]struct{} because struct{} occupies zero bytes — the language being honest about the relationship between sets and maps.
Rust HashSet<T>: backed by hashbrown::HashSet, which is a wrapper over HashMap<T, ()>. Same Swiss-table machinery, SIMD probes, all of it.
PostgreSQL query planner: IN (...) clauses past about 10 values get converted to a hash semi-join, which builds an internal hash set out of the IN list. This is why WHERE id IN (1,2,3) and WHERE id IN (10,000 ids) look completely different in EXPLAIN.
Spark / Presto / BigQuery: DISTINCT, UNION, INTERSECT, and EXCEPT are all hash-set operations under the hood (or merge-based when input is pre-sorted). UNION ALL skips dedup entirely and is dramatically cheaper than UNION — a fact every data engineer eventually learns the expensive way.

The "followers you both follow" case from the top of this lesson is the cleanest possible illustration. Take the smaller follower set. For each entry, ask the larger set whether it contains that account. Build up the result. The total cost is the size of the smaller side, not the sum.

What to learn next

Linked lists (singly) — the structure inside each bucket's chain, finally taught on its own terms.
Sliding window (variable) — the first pattern that uses a hash set for live deduplication while a window moves.
Bloom filters — what you reach for when even a hash set's space cost is too much.