Chapter 19 - Notes

Sets

The std::set and std::multiset are containers that facilitate fast lookup of keys.

  • Lookup: the process of looking something up

  • Keys: keys are simply unique values stored in a container

Set vs Multiset

A set only contains unique values, while a multiset can contain duplicate values.

Visual representation of a set vs multiset (only demonstrative)

Pros and Cons

The set and multiset containers are useful in applications that require frequent lookups.

However to facilitate this fast lookup, there is an overhead incurred when inserting elements because they have to maintain their sorted state.

Looking up elements in a sorted container is always quicker than iterating through an unsorted container.

Binary Trees

In computer science, a binary tree is a tree data structure where each node has at most two children, referred to the left child and the right child respectively.

A binary search tree (BST) is a special type of binary tree that follows a set of ordering rules. The ordering rules for a BST are as follows:

  • The left child node will always bear a value smaller than its parent node.

  • The right child node will always bear a value larger than its parent node.

If we follow these rules while populating our binary tree, we can infer the following:

  • All nodes to the right tree of a parent node will have values greater than the parent node.

  • All nodes to the left tree of a parent node will have values smaller than the parent node.

BST Insertion Example

To insert a value within a BST, we first start at the root node, then:

  • If the value of the current node is greater than the value we're inserting, check the left subtree.

  • If the value of the current node is smaller than the value we're inserting, check the right subtree.

  • In any case, if the subtree does not exist, insert our node there.

  • If the value of the current node equals the value we're looking for, exit the function.

BST Finding

To find a value within a BST, we first start at the root node, then:

  • If the value of the current node is greater than the value we're looking for, check the left subtree.

  • If the value of the current node is smaller than the value we're looking for, check the right subtree.

  • In any case, if the subtree we should look for does not exist, return false.

  • If the value of the current node equals the value we're looking for, return true.

Conclusion

The numbers of elements that we have to iterate over when we search or insert into a binary tree is proportional to the depth (or height) of the binary tree.

This is obviously more efficient that comparing against every element of the container.

In the future you will learn that this turns out to be an O(log N) operation.

Hash Sets

Searching in a sorted container represented by a binary tree is faster than searching in an unsorted container such as std::vector.Yet, this improvement is not enough at times.

A hash-based implementation, where a hash function is used to determine the index of the element in a sparse array, is able to perform lookups and insertions at constant time.

  • Hash function: a function that can be used to convert data of any type into fixed-size integer values.

  • Sparse array: a sparse array is an array in which most of its elements are zero, or have no value.

Hash sets are used by the std::unordered_set container.

Visualisation

https://www.cs.wcupa.edu/rkline/ds/hash-sets.html

Example

#include <iostream>
#include <vector>

// Simple hash function that sums the ASCII value for each character in the string 
int hash(const std::string& value) {
    int result = 0;
    for (int i = 0; i < value.size(); i++) {
        result += (int)value[i];
    }
    return result;
}

int main() {
    // A hash function
    std::cout << hash("Alice") << std::endl;
    std::cout << hash("Bob") << std::endl;
    std::cout << hash("Charlie") << std::endl;

    // Create a sparse array of 1000 strings
    std::vector<std::string> hashTable;
    hashTable.resize(1000);

    // To add an element to a hashTable, hash it, then use the result as the index
    int index = hash("Alice");
    hashTable[index] = "Alice";
    std::cout << "Index of Alice is: " << index << std::endl;

    index = hash("Bob");
    hashTable[index] = "Bob";
    std::cout << "Index of Bob is: " << index << std::endl;

    index = hash("Charlie");
    hashTable[index] = "Charlie";
    std::cout << "Index of Charlie is: " << index << std::endl;

    // To find an element in a hashTable, hash it, then use the result as the index 
    index = hash("Alice");
    if (hashTable[index] != "") {
        std::cout << "Alice present in the hash set" << std::endl;
    } else {
        std::cout << "Alice NOT present in the hash set" << std::endl;
    }

    index = hash("Daniel");
    if (hashTable[index] != "") {
        std::cout << "Daniel present in the hash set" << std::endl;
    } else {
        std::cout << "Daniel NOT present in the hash set" << std::endl;
    }

    // If our sparse array is smaller than the range of the hash function, then we can use modulo to limit its range 
    std::string very_long_string = "This is a very long string that would not have fit!";

    index = hash(very_long_string);
    std::cout << "Index of very long string: " << index << std::endl;

    index = hash(very_long_string) % hashTable.size();
    std::cout << "Index of very long string (with modulo): " << index << std::endl;

    hashTable[index] = very_long_string;

    // Likewise, when we check for a presence of an element, we will have to modulo the result of the hash function as well
    index = hash(very_long_string) % hashTable.size();
    if (hashTable[index] != "") {
        std::cout << "Very long string present in the hash set" << std::endl;
    } else {
        std::cout << "Very long string NOT present in the hash set" << std::endl;
    }
}

Sets

For hash sets, we used a hash function to convert a value of a certain type into an integer, which we can apply the modulo operator to to obtain a number restricted to a certain range.

This number was then used as an index into a boolean array, where a value true at that index indicates that the element exists in our set, and a value of false indicates the opposite.

While this attempts to illustrate the main idea of hash sets, there's one fatal flaw in that it does not address hash collisions - where two distinct keys hash to the same index in our hash set.

A common solution to this problem is to instead store a vector of keys at each index of the hash set, commonly referred to as a bucket. The bucket would store all keys that hash into that particular value.

Search then becomes a little more complicated in that you have to:

  • First, look up the bucket corresponding to the value by hashing the value and using the result as an index to the hash set

  • Then, search the bucket for a key matching the value we're searching for

Conclusion

If the hash function can be performed in constant time, then indexing into the hash table, used for both insertion and lookups, can also be performed at constant time.

This is at the expense of a higher memory footprint for the sparse array, and can be quite difficult to implement in practice. For example what happens when two values have the same hash value? This is known as hash collisions and in the future you will learn the different solutions that address this.

For now it is sufficient to know in principle how hash sets work:

  • What is a hash function?

  • What is the purpose of a hash function in the context of a hash set?

  • What is a sparse array?

  • What is the consequence of having a sparse array that is too small?

Last updated

Was this helpful?