30  Hash Tables

31 Hash Tables

Week 4, Session 2. CSS 343.

31.1 Warmup

Two Sum (#1). Brute force then optimized.

// O(n²)
vector<int> twoSum_brute(vector<int>& nums, int target) {
  for (int i = 0; i < (int)nums.size(); ++i) {
    for (int j = i + 1; j < (int)nums.size(); ++j) {
      if (nums[i] + nums[j] == target) return {i, j};
    }
  }
  return {};
}

// O(n) — hash map
vector<int> twoSum(vector<int>& nums, int target) {
  unordered_map<int, int> seenAt;
  for (int i = 0; i < (int)nums.size(); ++i) {
    int complement = target - nums[i];
    if (seenAt.count(complement) == 1) return {seenAt[complement], i};
    seenAt[nums[i]] = i;
  }
  return {};
}

The hash-map version is the canonical illustration of using O(n) extra space to drop a time-complexity factor. The trade is almost always worth it.

31.2 Learning objectives

  1. Evaluate a hash function for uniformity and efficiency.
  2. Trace separate chaining and linear probing on a small table.
  3. Explain why load factor > 0.7 triggers rehashing and state amortized O(1).
  4. Use unordered_map/unordered_set with custom key types via a hash functor.
  5. Solve frequency-counting and two-sum problems optimally using hash maps.

31.3 What a hash table is

An array, plus a function that maps keys to array indices.

class StringMap {
  vector<string> data_;
  int hash(const string& key) const {
    // some integer derived from key
  }
};

If hash is good, lookup is O(1) — compute the index, look at the slot. If two keys hash to the same slot — a collision — we need a strategy.

31.4 Hash functions

A good hash function:

  1. Distributes keys roughly uniformly across the table.
  2. Is fast to compute.
  3. Uses all parts of the key.

For integers: hash(k) = k % tableSize. For strings, a polynomial rolling hash:

size_t hash(const string& s, size_t tableSize) {
  size_t h = 0;
  for (char c : s) h = h * 31 + c;
  return h % tableSize;
}

Multiply by 31 (a small prime), add the next character. Each character influences the final value. Mod by table size to get an index.

std::hash<string> does essentially this with better constants and security mitigations. You can use it directly.

31.5 Collision resolution: separate chaining

Each slot holds a linked list (or vector) of all keys that hash there.

table[0] → [key1, key17, key29]
table[1] → []
table[2] → [key5]
table[3] → [key3, key19]

Lookup: hash the key, scan the bucket’s list.

template <typename K, typename V>
class HashMap {
 public:
  HashMap(size_t buckets = 16) : buckets_(buckets), size_(0) {}

  void put(const K& key, const V& value) {
    auto& bucket = buckets_[hash(key) % buckets_.size()];
    for (auto& [k, v] : bucket) {
      if (k == key) { v = value; return; }
    }
    bucket.push_back({key, value});
    ++size_;
    if ((double)size_ / buckets_.size() > 0.75) rehash();
  }

  bool get(const K& key, V& out) const {
    const auto& bucket = buckets_[hash(key) % buckets_.size()];
    for (const auto& [k, v] : bucket) {
      if (k == key) { out = v; return true; }
    }
    return false;
  }

 private:
  vector<list<pair<K, V>>> buckets_;
  size_t size_;
  void rehash();
};

31.6 Collision resolution: open addressing

If the slot is taken, probe to the next slot. Variants:

  • Linear probing: slot, slot+1, slot+2, …
  • Quadratic probing: slot, slot+1, slot+4, slot+9, …
  • Double hashing: slot, slot + h2(key), slot + 2·h2(key), …

Linear probing is fast in practice (cache-friendly) but suffers from clustering — runs of occupied slots that grow longer. Once load factor exceeds ~0.5, performance degrades.

31.7 Load factor and rehashing

Load factor = (number of items) / (table size).

When the load factor exceeds a threshold (often 0.7), the table doubles in size and every key is re-hashed into the new table. That single operation is O(n), but it happens rarely enough that the amortized cost per insert is O(1).

template <typename K, typename V>
void HashMap<K, V>::rehash() {
  vector<list<pair<K, V>>> old = std::move(buckets_);
  buckets_.assign(old.size() * 2, {});
  size_ = 0;
  for (auto& bucket : old) {
    for (auto& [k, v] : bucket) put(k, v);
  }
}

31.8 Custom hash functors

unordered_map<pair<int,int>, int> does not compile out of the box because pair has no built-in hash. Provide one:

struct PairHash {
  size_t operator()(const pair<int,int>& p) const {
    return hash<int>()(p.first) ^ (hash<int>()(p.second) << 1);
  }
};

unordered_map<pair<int,int>, int, PairHash> grid;

A simple combine — XOR with a shift — works for short pairs. For better mixing, use boost::hash_combine or implement a similar function.

Writing your own hash for a custom type is a common interview question and a real-world need. The mechanics are simple: combine the hashes of the fields. The art is making it well-distributed.

31.9 When NOT to use a hash table

  • You need keys in sorted order → use map / set.
  • You need fast range queries (e.g. all keys between 10 and 20) → balanced BST.
  • Keys have very poor hash distribution → hash table degenerates.
  • Adversarial input → security risk (hash-collision DoS attacks).

For most application code, unordered_map is the right default.

31.10 Try it

Insert {10, 20, 30, 15, 25} into a hash table of size 7 using key % 7. Show the table state with (a) separate chaining and (b) linear probing.

  1. Separate chaining: 10 % 7 = 3, 20 % 7 = 6, 30 % 7 = 2, 15 % 7 = 1, 25 % 7 = 4. Table: [_, 15, 30, 10, 25, _, 20].

  2. Linear probing: same hashes, no collisions → same table.

Now insert 17 (17 % 7 = 3). Chaining: bucket 3 becomes [10, 17]. Linear probing: slot 3 occupied (10), try 4 (occupied, 25), try 5 (empty). Insert at 5.

31.11 Self-check

1. Why is unordered_map::find amortized O(1) despite occasional O(n) rehashes?
c. Each rehash is O(n) but doubles capacity, so n inserts trigger O(log n) rehashes totaling O(n) work. Amortized O(1) per insert.
2. unordered_map<pair<int,int>, int> fails to compile because:
b. Supply a custom hash functor as the third template argument.

31.12 Challenges

  1. Group Anagrams (#49) — unordered_map<string, vector<string>>.
  2. LRU Cache (#146) — unordered_map + doubly linked list.
  3. Longest Substring Without Repeating Characters (#3) — sliding window + hash set.
  4. Implement HashMap<K, V> with separate chaining; benchmark against unordered_map.

31.13 Where this fits

Hash tables are the fast-lookup workhorse. The next chapter goes the other direction — guaranteed O(log n) via balanced trees.

You are here Coming up
Hash tables, chaining, probing, load factor, custom hash Chapter 29: AVL trees and red-black trees