30 Hash Tables

31 Hash Tables

Week 4, Session 2. CSS 343.

31.1 Warmup

Two Sum (#1). Brute force then optimized.

// O(n²)
vector<int> twoSum_brute(vector<int>& nums, int target) {
  for (int i = 0; i < (int)nums.size(); ++i) {
    for (int j = i + 1; j < (int)nums.size(); ++j) {
      if (nums[i] + nums[j] == target) return {i, j};
    }
  }
  return {};
}

// O(n) — hash map
vector<int> twoSum(vector<int>& nums, int target) {
  unordered_map<int, int> seenAt;
  for (int i = 0; i < (int)nums.size(); ++i) {
    int complement = target - nums[i];
    if (seenAt.count(complement) == 1) return {seenAt[complement], i};
    seenAt[nums[i]] = i;
  }
  return {};
}

The hash-map version is the canonical illustration of using O(n) extra space to drop a time-complexity factor. The trade is almost always worth it.

31.2 Learning objectives

Evaluate a hash function for uniformity and efficiency.
Trace separate chaining and linear probing on a small table.
Explain why load factor > 0.7 triggers rehashing and state amortized O(1).
Use unordered_map/unordered_set with custom key types via a hash functor.
Solve frequency-counting and two-sum problems optimally using hash maps.

31.3 What a hash table is

An array, plus a function that maps keys to array indices.

class StringMap {
  vector<string> data_;
  int hash(const string& key) const {
    // some integer derived from key
  }
};

If hash is good, lookup is O(1) — compute the index, look at the slot. If two keys hash to the same slot — a collision — we need a strategy.

31.4 Hash functions

A good hash function:

Distributes keys roughly uniformly across the table.
Is fast to compute.
Uses all parts of the key.

For integers: hash(k) = k % tableSize. For strings, a polynomial rolling hash:

size_t hash(const string& s, size_t tableSize) {
  size_t h = 0;
  for (char c : s) h = h * 31 + c;
  return h % tableSize;
}

Multiply by 31 (a small prime), add the next character. Each character influences the final value. Mod by table size to get an index.

std::hash<string> does essentially this with better constants and security mitigations. You can use it directly.

31.5 Collision resolution: separate chaining

Each slot holds a linked list (or vector) of all keys that hash there.

table[0] → [key1, key17, key29]
table[1] → []
table[2] → [key5]
table[3] → [key3, key19]

Lookup: hash the key, scan the bucket’s list.

template <typename K, typename V>
class HashMap {
 public:
  HashMap(size_t buckets = 16) : buckets_(buckets), size_(0) {}

  void put(const K& key, const V& value) {
    auto& bucket = buckets_[hash(key) % buckets_.size()];
    for (auto& [k, v] : bucket) {
      if (k == key) { v = value; return; }
    }
    bucket.push_back({key, value});
    ++size_;
    if ((double)size_ / buckets_.size() > 0.75) rehash();
  }

  bool get(const K& key, V& out) const {
    const auto& bucket = buckets_[hash(key) % buckets_.size()];
    for (const auto& [k, v] : bucket) {
      if (k == key) { out = v; return true; }
    }
    return false;
  }

 private:
  vector<list<pair<K, V>>> buckets_;
  size_t size_;
  void rehash();
};

31.6 Collision resolution: open addressing

If the slot is taken, probe to the next slot. Variants:

Linear probing: slot, slot+1, slot+2, …
Quadratic probing: slot, slot+1, slot+4, slot+9, …
Double hashing: slot, slot + h2(key), slot + 2·h2(key), …

Linear probing is fast in practice (cache-friendly) but suffers from clustering — runs of occupied slots that grow longer. Once load factor exceeds ~0.5, performance degrades.

31.7 Load factor and rehashing

Load factor = (number of items) / (table size).

When the load factor exceeds a threshold (often 0.7), the table doubles in size and every key is re-hashed into the new table. That single operation is O(n), but it happens rarely enough that the amortized cost per insert is O(1).

template <typename K, typename V>
void HashMap<K, V>::rehash() {
  vector<list<pair<K, V>>> old = std::move(buckets_);
  buckets_.assign(old.size() * 2, {});
  size_ = 0;
  for (auto& bucket : old) {
    for (auto& [k, v] : bucket) put(k, v);
  }
}

31.8 Custom hash functors

unordered_map<pair<int,int>, int> does not compile out of the box because pair has no built-in hash. Provide one:

struct PairHash {
  size_t operator()(const pair<int,int>& p) const {
    return hash<int>()(p.first) ^ (hash<int>()(p.second) << 1);
  }
};

unordered_map<pair<int,int>, int, PairHash> grid;

A simple combine — XOR with a shift — works for short pairs. For better mixing, use boost::hash_combine or implement a similar function.

Writing your own hash for a custom type is a common interview question and a real-world need. The mechanics are simple: combine the hashes of the fields. The art is making it well-distributed.

31.9 When NOT to use a hash table

You need keys in sorted order → use map / set.
You need fast range queries (e.g. all keys between 10 and 20) → balanced BST.
Keys have very poor hash distribution → hash table degenerates.
Adversarial input → security risk (hash-collision DoS attacks).

For most application code, unordered_map is the right default.

31.10 Try it

Insert {10, 20, 30, 15, 25} into a hash table of size 7 using key % 7. Show the table state with (a) separate chaining and (b) linear probing.

Separate chaining: 10 % 7 = 3, 20 % 7 = 6, 30 % 7 = 2, 15 % 7 = 1, 25 % 7 = 4. Table: [_, 15, 30, 10, 25, _, 20].
Linear probing: same hashes, no collisions → same table.

Now insert 17 (17 % 7 = 3). Chaining: bucket 3 becomes [10, 17]. Linear probing: slot 3 occupied (10), try 4 (occupied, 25), try 5 (empty). Insert at 5.

31.11 Self-check

1. Why is unordered_map::find amortized O(1) despite occasional O(n) rehashes?

a. Rehashes are O(1).
b. The compiler optimizes them away.
c. Rehashes happen so rarely that their cost averaged over all operations is O(1).
d. They don't actually happen — the table is fixed size.

c. Each rehash is O(n) but doubles capacity, so n inserts trigger O(log n) rehashes totaling O(n) work. Amortized O(1) per insert.

2. unordered_map<pair<int,int>, int> fails to compile because:

a. Pairs cannot be keys.
b. There is no default std::hash specialization for pair.
c. Pairs are too large.
d. The constructor signature is ambiguous.

b. Supply a custom hash functor as the third template argument.

31.12 Challenges

Group Anagrams (#49) — unordered_map<string, vector<string>>.
LRU Cache (#146) — unordered_map + doubly linked list.
Longest Substring Without Repeating Characters (#3) — sliding window + hash set.
Implement HashMap<K, V> with separate chaining; benchmark against unordered_map.

31.13 Where this fits

Hash tables are the fast-lookup workhorse. The next chapter goes the other direction — guaranteed O(log n) via balanced trees.

You are here	Coming up
Hash tables, chaining, probing, load factor, custom hash	Chapter 29: AVL trees and red-black trees