Lecture: 23 Feb, 2009

Khaled Harras (With a nod to Mark Stehlik)

Hashing

Lab 4 involves hashing. Hashing in computer science is about reducing the cardinality of the set of keys that you have to initially search through to a more manageable size. In doing so, you typically trade uniqueness of the result of your query for a faster search (unless your hash function is perfect, which we will discuss).

Perhaps you've encountered real-world hashing situations. A multiple-drawer filing cabinet vs. a linear collection of file folders is an example of hashing. In my office, I have a drawer for seniors, juniors, and sophomores. To find a student, I first need to "compute" their year, then find them in the drawer. Of course, this example is a little stilted because if this were a real hash table, I wouldn't care what order the students were in inside the drawer. But my student folders are alphabetical (because my real-life implementation of linear search is much slower than a computer's!). Even though the example is imperfect, it serves to illuminate some key concepts of hashing:

hashing involves mapping a key to an integer (in this case, a drawer) via a hash function
unless the hash function is perfect (if ∀ keys, h(hey) is distinct, i.e., h(key1) ≡ h(key2) ⇒ key1 ≡ key2), there will be collisions
those collisions need to be resolved somehow (either with chaining or open addressing).

The most important thing in hashing is to pick a decent hash function (there are lots of bad ones, like hash(key) = 3;). A decent hash function should be easy/fast to compute and should distribute the keys across the table as uniformly as possible. But it's not necessary to try to come up with a perfect hash function all the time (and, in most cases, it's bloody unlikely you'll be able to do so). Since most hash functions are likely to be imperfect (i.e., generate collisions), the two most important things to determine (after coming up with a decent hashing function) is the size of the table and the collision resolution scheme.

For the current assignment, which has a maximum file size of 172,000 words, we'll use a table whose size is fixed at 1000. Also, we'll use a fairly standard hashing function for strings - the sum of the ASCII value of a character (obtained by casting a character to an int) multiplied by its position in the string. Since our table size is small with respect to the size of our largest input file, we will resolve collisions via chaining, so each element of the array will be the head of a linked list (i.e., a pointer to a ListNode).

I wrote a quick "sanity-check" piece of code that determined that the table/as function combination results, for the 172,000 word file, in all 1000 cells of the table being used, with the largest list of collisions numbering 365 and the smallest 19. Not bad when compared to all the work to doubling a many-thousand cell array as in Lab 3 or trying to find an item in a 172,000 node linked list!

For the record, my reference solution for Lab 3 took 10 seconds on unix.qatar to read the 42K file, but 8 minutes to read the 172K file! My reference solution for Lab 4 takes under a second to read the 172K file!

Makefiles

The programs we write in this course are small and actually do not justify the use of makefiles. In later courses however they will useful. Simply put, a makefile allows you specify the interdependencies among various files and to recompile your program by only recompiling those files that have changed (or depend on files that have changed) since the last compilation. The make utility compares the executable's last modification date with the modification dates of the .o files that it depends on. Likewise, those .o files modification dates are compared against their respective source files. If any files need to be recompiled, they are and the executable is rebuilt.

The makefile is a dependency tree where each node is dependent on its children. At the root is the executable file that you wish to run. Its children are the object files that it depends on. The children of each object file are the source files it depends on. We'll run through an overview of the compilation process and makefiles and then look at an example, makeDemo.zip, that contains several files that comprise a program along with a makefile that builds the executable.

The files that comprise this project are:

makeDemo.c (contains the main & includes arraylib.h files)
arraylib.c (includes both .h files)
arraylib.h
intlib.c (includes intlib.h)
intlib.h

The direct dependency tree is:



                           /   makeDemo.c
               makeDemo.o
             /             \   arraylib.h

          /

       /                    /  arraylib.c
                          /
 demo  -  -  - arraylib.o - -  arraylib.h
                          \
       \                    \  intlib.h

          \
                           /   intlib.c
             \   intlib.o
                           \   intlib.h

Notice the hierarchy: the executable depends on the object (.o) files. Each object file depends on source (.c, .h) files. Further notice that the only files we include in a source file's dependency list are those source files that are needed to compile that source file. Thus arraylib.o depends on arraylib.c, arraylib.h and intlib.h. intlib.h is included in arraylib.c and thus arraylib.o depends on it.

We execute the commands in Makefile by typing make. make will look first for a file called makefile and then Makefile for the commands to execute in building your executable (you can override this with the -f switch. If you haven't already done so, download the makeDemo.zip file and unzip it. Once you have the makeDemo directory, cd into it and type make. You should see something similar to the following:


$ cd makeDemo
$ make
gcc -Wall -pedantic   -c -o makeDemo.o makeDemo.c
gcc -Wall -pedantic   -c -o arraylib.o arraylib.c
gcc -Wall -pedantic   -c -o intlib.o intlib.c
gcc -Wall -pedantic -o demo makeDemo.o arraylib.o intlib.o -lm
$
$ ./demo
(the program will execute)

In the above excerpt we executed the Makefile by typing make in the same directory as the files. Note that all the .c files were compiled and the executable was built. We now want to demonstrate how the make utility only recompiles those files needed to build the project after one or more source files have been changed.

Let's now edit intlib.h and add a #define, then execute make again. Notice the result is that all the source files except makeDemo.c were recompiled because ultimately all the .o files except makeDemo.o depend on something that was affected by that change. Contrast this with what happens if we change intlib.c.

Lastly, let's change arraylib.h and execute the Makefile again. Note that this time every source file except intlib.c was recompiled. The Makefile did NOT recompile intlib.c because it is the only .c file that does NOT depend on (i.e., does NOT #include) arraylib.h.

This sample Makefile is as complex an example as we will likely need in this course. In fact, a simple one-liner that just recompiles all the files if any source is changed would be sufficient enough for the programs we write. However, when we encounter very large programs with many source files and hundreds of thousands of lines of code, the use of a makefile to keep track of dependencies is a necessity. A common example of such a scenario is compiling the Linux OS for installation on a microcomputer. Linux is open source and many experienced users prefer to download the source for the OS and compile it on their machine, rather than download a binary. Often a user will make changes to the source code to customize some aspect of the OS to their particular purpose. As these changes are being made, tested, and recompiled incrementally, the makefile prevents having to recompile ALL the code every time.