Lecture 14 - Implementing Hashmaps

# Lecture 14 -- Implementing Hashmaps ## Motivating Question How can we build HashMaps that lookup the values of keys of arbitrary types in (effectively) constant time? ## Conceptual Study Questions - What does a `hashCode` function do? - What does "collision" mean in the context of a HashMap? - What situations can cause collisions? - If you resize the array within a HashMap, do the same collisions occur? - If our keys were integers rather than strings, could we still get collisions? ## High-Level -- How do HashMaps work? Consider the following diagram showing a set of keys mapping to a set of values: ![a key-value mapping](https://docs.cs200.io/uploads/8b1bb3e4-30de-4f4a-a8af-fda955cdf7c3.png =255x150) Accessing a value (room) in constant time given a key (lab name), requires three features: 1. *the values must be in an array*. There is no other use of heap space in which we can locate one of several related objects in constant time (unless each item was under a variable with an individual name from the environment---why would that be a problem?) 2. *the array index (integer) in which an individual key is stored must be determined (or computed) in constant time* 3. *the value for a key must always be located within the array index that has been computed for that key* ### Hash Functions -- Convert keys to integers Let's look at requirement #2: how can we convert a string to an integer? You already know several functions that do this. - you could compute the length of the string - you could count the number of spaces in the string - you could multiply the number of vowels by the number of consonants - and so on We use the term *hash function* or *hashCode* (specific to Java) for the function that converts an object to an integer. Would any of these be good hash functions for Strings? Looking at our lab example again, most of these keys have the same lengths; all have the same number of spaces. This means that most of the keys would convert to the same integer, which means they would end up in the same index in the array (there would be lots of collisions). Since the performance of a HashMap relates to distributing the keys across the array, none of these are good choices. #### ASCII Codes It turns out that converting strings to integers is something that happens all over computer science. Why? Because ultimately, the memory in your computer is only capable of storing numbers (0s and 1s to be exact). That means that all sorts of computing tools need ways to represent strings via integers (which then get converted to 0s and 1s via binary representations of numbers, but let's not get too far off topic). [ASCII Codes](https://www.ascii-code.com/codechart) are a shared international standard for representing individual characters as integers. Each character maps to a unique integer. We can use ASCII codes to hash strings to integers in various ways. One way is to sum up the ASCII codes for each character in a string. For example, here are the ASCII codes for several characters (the previous link shows the entire table): - “A” is 65 - “M” is 77 - “T” is 84 - “1” is 49 - “4” is 52 - “8” is 56 For sake of illustration, let's reduce our lab times to two character strings, such as "M8" instead of "Mon 8-10". We could then convert "M8" to an integer with the expression $$ascii("M") + ascii("8")$$ Following the conversions above, this is 77 + 56, which equals 133. Similarly, "T4" would convert to 84 + 52, which equals 136. If we do this for all of our lab times, we get: - the hashCode of "M4" is 129 - the hashCode of "M8" is 133 - the hashCode of "T1" is 133 - the hashCode of "T4" is 136 - the hashCode of "T7" is 139 ### Convert From Integers to Array Indices We've figured out how to convert our String keys for the labs example into integers. But in order to store values for keys in array cells, we need to assign each of these hashCode values to array indices. Let's imagine that our HashMap was using an array of length 10. We'd thus need to convert each of the above hashCodes to an index from 0 through 9. How might we do that? Again, our friend modulo to the rescue! Just as we used modulo to convert list positions into array indices in cyclic arrays, we can use it here to convert hashCodes to array indices as follows: - the array index for "M4" is `129 % 10`, which equals 9 - the array index for "M8" is `133 % 10`, which equals 3 - the array index for "T1" is `133 % 10`, which equals 3 - the array index for "T4" is `136 % 10`, which equals 6 - the array index for "T7" is `139 % 10`, which equals 9 ### Putting it all Together Based on our formula that converts strings to indices, here's the underlying array that results: | Array | | -------- | | null | | null | | null | | the value for M8 and the value for T1 | | null | | null | | the value for T4 | | null | | null | | the value for "M4" and the value for "T7" | But wait -- didn't we say that a HashMap could have at most one value per key? How can we then have multiple values in the same array cell? When two keys are placed in the same array cell, we have what is called a "collision" in HashMap terminology. Here, we had collisions arises from two sources: - two strings converted to the same integer (e.g., both "M8" and "T1" converted to hashCode 133) - two hashCodes mapped to the same array index after modulo (e.g., both 129 and 139 map to index 9 in an array of length 10) Collisions are unavoidable in hashmaps. Sometimes we get lucky and don't have them, but mathematically we always have to assume that they can occur. This means we have to support multiple values in a single array cell. **Exercise** What if the array had been of length 7 instead? To which index would each key have been assigned in that case? #### The Array Must Hold Lists of Values To put multiple values in a single array cell, that cell must contain a data structure of values, rather than a single value. Typically, that data structure is a (linked) list: ![a key-value mapping](https://docs.cs200.io/uploads/8b1bb3e4-30de-4f4a-a8af-fda955cdf7c3.png =255x150) | Array | | -------- | | null | | null | | null | | List("CIT501", "CIT501") *-- the values for M8 and T1* | | null | | null | | List("CIT267") *-- the value for T4* | | null | | null | | List("CIT444", "CIT501") *-- the values for "M4" and "T7"* | #### How the `get` method works Assume we want to get the value for key "T4" from the HashMap. We would convert "T4" to array index 6 and return the value found (here "CIT267"). But what if we asked to get the value for key "T7". We compute that its value is in array index 9, but now there are two values in the cell's list. Which one goes with "T7"? With this organization of data, we have no way of knowing. #### Actually, the Array must hold lists of pairs of keys and values To disambiguate which values go with which keys, what we actually store in each array cell is a list of objects called "key-value pairs". ```=java class KVPair<K, V> { public K key; public V value; } ``` | Array | | -------- | | `List()` | | `List()` | | `List()` | | `List(KVPair("M8",CIT501"), KVPair("T1","CIT501")` | | `List()` | | `List()` | | `List(KVPair("T4",CIT267"))` | | `List()` | | `List()` | | `List(KVPair("M4","CIT444"), KVPair("T7","CIT501")` Now, when we are asked to `get` the value for a key, we go to the array index corresponding to the key and search the list there to find the `KVPair` that contains that key. We return the value from that same `KVPair`. ### The Runtime of HashMap `get` Hold on -- once we have to search a list to find the value for a key, aren't we using linear time? Wasn't the whole point of hashmaps to retrieve values in constant time? This is a good time to remind ourselves that anytime we talk about runtime, we have to include the variable over which we mean "linear" or "constant". For HashMaps, we want lookup to be constant in the number of key/value pairs. That is a different variable than the lists inside the array cells. To keep hashmaps (effectively) constant time, we limit the length of each list in each array cell to a constant length. In that case, our `get` method would be "linear in the max size of the arrays", but still "constant in the number of key-value pairs". Actually keeping the internal lists to a constant takes some careful engineering, along with resizing the arrays when the internal lists get too long. This is more detail that we'll go into here. In particular, **when you implement HashMaps for homework, you do not need to guarantee constant length lists**. ### Implementing Hashmaps: Overview of the Code Let's put all of this conceptual overview into code (which you will finish for homework). First, here is the interface for HashMap operations (here, we use the name `IDictionary` to avoid confusion with Java HashMaps, while also noting that these are called dictionaries in other languages) ```=java= public interface IDictionary<K, V> { // corresponds to Java HashMap get public V lookup(K key) throws KeyNotFoundException; // change the value associated with a key already in the hashmap public V update(K key, V value) throws KeyNotFoundException; // insert a value for a new key public void insert(K key, V value) throws KeyAlreadyExistsException; // remove a key (and its value) from the dictionary public V delete(K key) throws KeyNotFoundException; } ``` In Java HashMaps, the `put` method covers both of the `update` and `insert` methods in `IDictionary`. We are separating them here to give you finer-grained control of error checking. We're calling the HashMap class itself "Chaining", since the lists form chains of KVPairs. We've given you most of the `lookup` and `update` methods, with the exception of the `findKVPair` method. That is the method that computes the hashCode of the key and converts it to an array index using modulo. *For help with the `throws ... Exception` annotations, see the notes from the lecture on MVC and Exceptions.* ```=java= public class Chaining<K, V> implements IDictionary<K, V> { private static class KVPair<K, V> { public K key; public V value; } public Chaining(int size) { . . . } private KVPair<K, V> findKVPair(K key) throws KeyNotFoundException { . . . } public V lookup(K key) throws KeyNotFoundException { KVPair<K, V> pair = findKVPair(key); return pair.value; } public V update(K key, V value) throws KeyNotFoundException { KVPair<K, V> pair = findKVPair(key); V oldValue = pair.value; pair.value = value; return oldValue; } public void insert(K key, V value) throws KeyAlreadyExistsException { . . . } public V delete(K key) throws KeyNotFoundException { . . . } public boolean equals(Object ht) { . . . } public String toString() { . . . } } ``` ### Where does the code to compute a key's hashCode go? Every class in Java has a default `hashCode` method that produces an integer for objects of that class. For built-in classes, like String, the `hashCode` method can be accessed as follows: ```=java= "Learning Hashmaps".hashCode() String s = "hello bruno" s.hashCode() ``` If you want to use a class that you defined as a key, you should override the `hashCode` method (the default uses the address of the object in memory). This might look as follows: ```=java= class Boa { String name; int length; String eats @Override public integer hashCode() { return this.name.hashCode()*3 + this.length*5 + this.eats.hashCode()*7 } } ``` It is common to convert each field to its hashCode, multiply each hashCode to a unique prime number, and add them together. When computing hashCodes, only include fields that are included in the computation of the class' `equals` method. Otherwise, the underlying implementation can break.