25 September 2020

Hashed Data Types

What is hashable? A corned beef is hashable and tastes great with eggs. But this is another kind of hash.

A hash function is a function whose domain is a set of objects (such as strings) and whose codomain is the set of integers. A python object is hashable if a hash function is defined for it.


>>> hash(s)
221888878460316077
>>> s = (1,2,3,4)
>>> hash(s)
590899387183067792
>>> s = [1,2,3]
>>> hash(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unhashable type: 'list'

Strings and tuples are hashable (no coincidence: they are immutable.Lists are not hashable. Why? A hash function depends on the state of an object. It is a mathematical function and must be consistent. Because lists are mutable, their state is changeable. A hash function based on their state could not be consistent.

Look how integers hash. They are themselve until they get too big. (Spelunking adventure: How big??)


>>> hash(10)
10
>>> hash(1000)
1000
>>> hash(100000000)
100000000
>>> hash(1000000000000000)
1000000000000000
>>> hash(100000000000000000000000)
200376420520733032
>>> hash(True)
1
>>> hash(False)
0

Why hash other than using up leftover corned beef?

picture of yummy corned beef hash

The two resources for computing are bytes of memory and time. If you use very little memory, everything takes longer. If you use bales of memory, it consumes a lot of resources. Uh oh. Tradeoff time.

A hash table is a giant list(array). You put a hashable object in it by calling hash, modding by the size of the array, and plunking the item in a list at that location. If two hashed values collide, we store them in a list located at that array. This enables fast retrival and insertion into a hash table, provided it does not get overly crowded.

If things get crowded, the program can get more memory from the OS and a bigger array is allocated. Then, the contents of the old array are all inserted into the new, bigger house.

Here we trade using lot of memory for speed.

Sets and dictionaries are hashed data structures. We will now turn our attention to these.

Sets

A set is a container that does not allow duplicate entries.

Here is how we make an empty set.


>>> s = set()
>>> s
set()

Now let's add some items to our set. Notice that the add method has no return value.


>>> s.add("Friday")
>>> print(s.add("Saturday"))
None
>>> s
{'Saturday', 'Friday'}

Here is how to count the items in a set. No surprise.


>>> len(s)
2

Look what happens when we try to add a duplicate. Added duplicates are ignored. This can be checked very quickly, since duplicate values have the same hash code.


>>> s.add("Friday")
>>> s
{'Saturday', 'Friday'}

Stupid Questions for $400: How do we check for the presence of an item in set? Look at how we do it for a list.


>>> x = [1,2,3,4]
>>> 1 in x
True
>>> 5 in x
False

Duh.


>>> "Friday" in s
True
>>> "Easter" in s
False

Sets are heterogeneous.


>>> s.add(5)
>>> s
{5, 'Saturday', 'Friday'}

Bad! Unhashable things cannot be placed in a set. Python dishes out punishment.


>>> s.add([1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Can we hash floats?


>>> hash(4.2)
461168601842739204

Yep. Now let's make another set so s has some company. You can pass a list to set and it will make a set with any duplicates in the list deleted.


>>> s
{5, 'Saturday', 'Friday'}
>>> t = set([5, 4, "cows", "pigs"])
>>> t
{'pigs', 'cows', 4, 5}

Two sets are disjoint if they have no elements in common. Example: NCSSM Juniors and NCSSM seniors.


>>> t
{5, 'Saturday', 'Friday'}
>>> t.isdisjoint(s)
False

Here we can see the elements common to two sets, the intersection.


>>> t.intersection(s)
{5}

Here are all of the elments in two sets; this is the union.


>>> t.union(s)
{'cows', 4, 5, 'Saturday', 'Friday', 'pigs'}

The symmetric difference of two sets is the set of all items belonging to exactly one of them.


>>> t.symmetric_difference(s)
{'cows', 4, 'Saturday', 'Friday', 'pigs'}

To save typing, I am going to make some sets with numbers.


>>> s = set([2,3,4,5,6])
>>> t = set([2,4,6])

This evaluates to False because s contains element not in t


>>> s.issubset(t)
False

OTOH,


>>> t.issubset(s)
True

You can also do this.


>>> s <= t
False
>>> t <= s
True

Here is another way to do union, symmetric difference, and intersection.


>>> s | t
{2, 3, 4, 5, 6}
>>> s.union(t)
{2, 3, 4, 5, 6}
>>> s ^ t
{3, 5}
>>> s.symmetric_difference(t)
{3, 5}
>>> s & t
{2, 4, 6}
>>> s.intersection(t)
{2, 4, 6}
>>> s.difference(t)
{3, 5}

I beg to differ!


>>> s
{2, 3, 4, 5, 6}
>>> t
{2, 4, 6}
>>> s - t   #elements of s not in t
{3, 5}
>>> t - s   #elemeents of t not in s
set()
>>> (s - t) | (t - s)  #symmetric difference
{3, 5}

This is expected.


>>> 10 not in s
True

Here is how to comb duplictes from a list.


>>> s = [3,45,6,32,4,56,75,32,8,4,8,2,9]
>>> s = set(s)
>>> s = list(s)
>>> s
[32, 2, 3, 4, 6, 8, 9, 75, 45, 56]

Lists can be cast to tuples, too.


>>> tuple(set(s))
(32, 2, 3, 4, 6, 8, 9, 75, 45, 56)
>>>

Dictionaries

A dictionary is a collectioon of key-value pairs. Keys in dictionaries must be hashable. This enables rapid insertion and retrieval of items in a dicitonary.

Here is an empty dictionary.


>>> d = {}
>>> type(d)
<class 'dict'>

We now add an entry


>>> d["cat"] = "nine lives"
>>> d
{'cat': 'nine lives'}

You "index" into a dictionary using a key like so.


>>> print(d["cat"])
nine lives

Let's add to the menagerie.


>>> d["dog"] = "six fish"
>>> d["parrot"] = "rowdybush"
>>> d
{'cat': 'nine lives', 'dog': 'six fish', 'parrot': 'rowdybush'}

This tells you how many key-value pairs you have.


>>> len(d)
3

Observe.


>>> d["cat"] = "meow mix"
>>> d
{'cat': 'meow mix', 'dog': 'six fish', 'parrot': 'rowdybush'}

The value of "cat" is changed to "meow mix". Keys in dictionaries must be hashable.


>>> d[[1,2,3]] = "foo"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Values don't have to be hashable.


>>> d[(1,2,3)] = "foo"
>>> d
{'cat': 'meow mix', 'dog': 'six fish', 'parrot': 'rowdybush', (1, 2, 3): 'foo'}
>>> d["rule breaker"] = [1,2,3]
>>> d
{'cat': 'meow mix', 'dog': 'six fish', 'parrot': 'rowdybush', (1, 2, 3): 'foo', 'rule breaker': [1, 2, 3]}

Dictionaries are mutable and are therefore not hashable.


>>> hash(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

Casting a dictionary to a list creates a list of the dictionary's keys.


>>> list(d)
['cat', 'dog', 'parrot', (1, 2, 3), 'rule breaker']

YOu can use d.items() to walk through the key-value pairs in the dictionary.


>>> for item in d.items():
...     print(item)
... 
('cat', 'meow mix')
('dog', 'six fish')
('parrot', 'rowdybush')
((1, 2, 3), 'foo')
('rule breaker', [1, 2, 3])

This object is NOT a list, but it can be cast to one.


>>> type(d.items())
<class 'dict_items'>
>>> s = list(d.items())
>>> s
[('cat', 'meow mix'), ('dog', 'six fish'), ('parrot', 'rowdybush'), ((1, 2, 3), 'foo'), ('rule breaker', [1, 2, 3])]

Here are two ways to print out the values in a dictionary.


>>> for item in d.items():
...     print(item[1])
... 
meow mix
six fish
rowdybush
foo
[1, 2, 3]
>>> for k in d.values():
...     print(k)
... 
meow mix
six fish
rowdybush
foo
[1, 2, 3]
>>>

Dictionaries and Tallies

We are going to to a stochastic simulation of a simple system: throw a fair pair of dice and note the total number showing. What are the relative frequencies of various outcomes?

We begin by rolling dice.


import random
def rollDice():
    return (random.randint(1,6), random.randint(1,6))
for k in range(10):
    print(rollDice())

Run this and see rolls.


unix> python dice.py
(6, 2)
(3, 2)
(6, 6)
(6, 1)
(2, 4)
(4, 2)
(4, 2)
(6, 1)
(3, 1)
(6, 1)

Let's comment away the loop.


import random
def rollDice():
    return (random.randint(1,6), random.randint(1,6))
#for k in range(10):
    #print(rollDice())

We will use a dictionary to keep a tally of how many rolls of each type occur. Here is the idea. For each roll

The roll will be the key in dictionary
If the roll is not in the dictionary, we put it in an assign it a value of 1, since it has now been seen once.
If the roll is in the dictionary, we increment its value by 1


import random
def rollDice():
    return (random.randint(1,6), random.randint(1,6))
#for k in range(10):
    #print(rollDice())
tally = {}
num_trials = 10
for k in range(num_trials):
    roll = sum(rollDice())
    if roll in tally:
        tally[roll] += 1
    else:
        tally[roll] = 1

#this iterates throug the keys in order.
for k in sorted(tally.keys()):
    print k, tally[k]


unix> python dice.py
3 1
4 1
5 2
6 2
8 1
10 1
11 1
12 1

It worked for 10 rolls. Now let's try 36,000.


unix> python dice.py
2 964
3 1957
4 3001
5 4047
6 5025
7 5942
8 5025
9 4049
10 2964
11 1997
12 1029

Notice that 7 is "the most common number on the dice."