CSCI 241 - Homework 8: Huffman’s Algorithm
Due by 11:59.59pm, Wednesday, Dec 02
Introduction
For this assignment, you will be creating two programs (encode and decode) that will be performing the calculations needed for simple file compression. (For small files, it’ll might make things a little bigger.)
Things to note
- The repository URL for this assignment is https://classroom.github.com/a/gee6Wfup
- This project is trickier than most. Get started on it early!
Program behavior
Your encode program should read a text file specified on the command line and write a Huffman encoded version of that file to the specified output file. Similarly, the decode program will read a file generated by encode and write a decoded version of that file to a specified output file.
If no output file is specified, write to stdout.
% ./encode book.txt book.huf # encodes book.txt and writes it to # book.huf % ./encode book.txt > book.huf # encodes book.txt and writes it to # stdout (redirected to book.huf) % ./decode book.huf book.txt.2 # decodes the file, writing to book.txt.2 % diff -q book.txt book.txt.2 # check to see if files are the same # should print nothing if they are
Getting started
You will likely need to divide your code into three different parts, and therefore should be stored in 3 different files:
- Functions needed by the encode program
- Functions needed by the decode program
- Functions used by both programs
Now, it is possible to have only a single program that can do both encoding and decoding based on the filename, but to handle that, you’d need to check the value of argv[0] and determine which function to perform. It’s probably easier to just make 2 separate programs.
Recall that you can make an object file by using the “-c” flag when compiling. Then you can link the various object files together to make actual programs.
Program Design
Encoding
In order to encode a file, you will first need to construct a Huffman tree based on the frequency of letters in the file. Your first step should be to read the file from start to end and calculate either absolute or relative frequency of all the characters encountered. You should include the frequency of EOF (which should always be 1) and we will store that at the beginning of our list of nodes (logical index of -1).
You will then need to create a sorted list of nodes based on ascending frequency. To do this, I recommend that you use an insertion sort on a linked list. Insert new nodes starting with the index value of -1 and going up to index value of 255. Insert before items of equal value. Skip nodes with a frequency count of 0.
Then, you will need to convert this sorted list into a Huffman tree. While there are more than 2 nodes in your list, you should create a new node, attach the head item in the list as the left child, the second item in the list as the right child, update the frequency count for this new node, and insert it into your linked list. Be sure you’ve removed the two nodes that are now children before re-inserting.
Now traverse the tree keeping track of the string needed to reach that node based on using a character ‘0’ for a left branch and ‘1’ for a right branch. When you reach a leaf node, you will know what string is to be used to represent that character.
Now re-read the input file from the beginning and for each letter encountered, print the bit string that corresponds to that character. Be sure to output the string corresponding to the EOF notice too and stop after you do so.
File Format
The files to be encoded can be treated as simple 8-bit character files (but use CHAR_BIT instead of 8). What I mean by this is that if you call fgetc() you will get a character until you reach the end, at which point you will have EOF returned (which we’ll treat as if it has a character value of -1 – which it should). By treating these files as byte-oriented rather than printable ASCII, you should be able to encode both text and binary files.
The Huffman encoded output files will be a bit different. You need to include the binary trie representing the Huffman prefix codes. To do this, you will do a pre-order traversal of the trie using a 0-bit to indicate that it is an internal node and therefore has left and right children, or a 1-bit to indicate that it is a leaf node. Immediately following the 1-bit you will write the CHAR_BIT bits from most to least significant that make up the value of the character at that location in the tree.
Immediately following the pre-order traversal of the tree, you will write an initial bit-string representing EOF. (You will later use that to find the leaf that represents EOF instead of 255 and correct the value there.)
After the table, you should output the individual bits that are needed to
represent the input file. You’ll have to buffer the bits until you get
CHAR_BIT of them and then output it. (The most significant bit is the first
bit, and then they progress downward.)
Hint, you might want to look at what you wrote for homework 4 encode_bits and decode_bits.
Pad out the last incomplete character in the file with 0 bits. If you write (CHAR_BIT-1) 0-bits out then it will flush any remaining bits without creating a new character.
Decoding
To decode the file, you should first open the file specified on the command line. You then can read in the pre-order traversal of the tree, assembling it as you go. A 0-bit indicates an internal node which has both a left and right child. A 1-bit indicates it is a leaf and the next CHAR_BIT bits represent the value at that node from MSB to LSB. (I found a recursive function to work nicely for this.)
Now you need to fix the value of EOF in the tree. Switch over to a bitwise read/tree traversal routing where 0 indicates to go left and 1 indicates to go right. Once you hit the first leaf, you now have the location for the actual EOF marker and you should update the value there accordingly.
Now you continue with a bitwise read/tree traversal routine and use those to determine if you should go left on 0 or right on 1 in the tree. Once you reach a leaf, you should be at a letter. Print it and move back to the root. When you reach the EOF marker you should stop reading/printing and close both files. Nothing is printed for the EOF marker.
NOTE: You should not print out anything when you reach the EOF marker, and you should never reach the actual end of the encoded file.
Sample run – with internal state
INPUT: cheese Frequency Counts: -1 EOF 1 10 \n 1 99 c 1 101 e 3 104 h 1 115 s 1 Linked List (initial): s(1) -> h(1) -> c(1) -> \n(1) -> EOF(1) -> e(3) First pass: c(1) -> \n(1) -> EOF(1) -> (2) -> e(3) / \ s(1) h(1) Second pass: EOF(1) -> (2) --> (2) -> e(3) / \ / \ c(1) \n(1) s(1) h(1) Third pass: (2) ------> (3) ---> e(3) / \ / \ s(1) h(1) EOF(1) (2) / \ c(1) \n(1) Fourth pass: e(3) -----------> (5) / \ (2) (3) / \ / \ s(1) h(1) EOF(1) (2) / \ c(1) \n(1) Fifth (and final) pass: (8) / \ e(3) (5) / \ (2) (3) / \ / \ s(1) h(1) EOF(1) (2) / \ c(1) \n(1) Internal data: (including padding) char count bitstring ---- ----- --------- -1 EOF 1 110 10 \n 1 1111 99 c 1 1110 101 e 3 0 104 h 1 101 115 s 1 100 Tree: (with added spaces for clarity) 0 1 01100101 0 0 1 01110011 1 01101000 0 1 11111111 0 1 01100011 1 00001010 EOF: 110 Remainder of file: (spaces added, includes EOF) 1110 101 0 0 100 0 1111 110 Remainder is padding to make it a full char: 000000
You can also use some Unix tools to examine your output files:
File passed through xxd: 0000000: 594b 9da1 ff58 e15b a91f 80 YK...X.[... File passed through xxd -b: (bits) 0000000: 01011001 01001011 10011101 10100001 11111111 01011000 YK...X 0000006: 11100001 01011011 10101001 00011111 10000000 .[...
Design Ideas
You’ll need to be dynamically creating nodes, so malloc() and free() are your friends. Be sure to free() all the allocated data once you are done with it, and fclose() all files you opened. Valgrind should report that there were no memory leaks.
You might want to create a node struct that can be used in both a linked list and a tree simultaneously. So, you’ll want to have both “left” and “right” pointers as well as a “next” pointer.
You can create an array that has 256-buckets for your counts and hard-code the fact that your EOF node has a frequency of 1. This is the most straightforward approach.
If you want, you can create an array that has a valid position at index -1
by dynamically allocating an array and then setting a pointer to the address
of the first item in the array. If you then use the pointer in an array
context, you can go from index -1 to N-2.
Just remember that you need to free()
from the actual start of the array.
You should print a message and exit if you attempt to malloc something and it fails. Rather than just cutting and pasting this throughout your code, why not write a function that does the malloc(), the check for failure, and perhaps some initialization.
There is also sample binaries for you to play with in ~rhoyle/pub/cs241/hw08/
handin
README
Create a file called README that contains
- Your name and partner's name (if any)
- A description of the programs
- A listing of the files with a short one line description of the contents
- Any known bugs or incomplete functions
- An estimate of the amount of time you spent completing this assignment
- Any interesting design decisions you'd like to share
- Describe any unresolved warnings that are generated by valgrind and what you believe them to be caused by.
Now you should make clean to get rid of your executables/object files and handin your folder containing your source files, Makefile, and README.
Extra Credit
In this compression algorithm, we are looking at single characters to create our encoding tree. For extra credit, experiment with 2-3 character sequences, and see if they create a better compression tree.
Grading
Here is what I am looking for in this assignment:
- A working Makefile with your program, all, and clean as targets
- A program that will encode files using Huffman's algorithm as described above.
- A program that will decode files encoded with Huffman's algorithm as described above.
- An internal linked-list representation using structs
- Output matching the sample program
- Appropriately modular code
- Good comments
- Runs under valgrind with no errors or warnings
- Man pages for each program
- A README with the information requested above. The listing of known bugs is important.
Last Modified: November 02, 2022 - Roberto Hoyle from material created by Ben Kuperman