Chris Mills
Twitter: @MisterXopher
make
./prog compressed_file.Z
printf '000I\236\f\31H\260\240\301\203' > ioccc.Z
./prog ioccc.Z
dd bs=1k count=4096 < /dev/urandom > test1
compress < test1 > test1.Z
time ./prog test1.Z | cmp - test1
dd bs=1k count=4096 < /dev/zero > test2
compress < test2 > test2.Z
time ./prog test2.Z | cmp - test2
Without ASCII art, the source code of this entry would have been exactly 256 bytes. Compared to 1990/jaw, it is very impressive, even taking into account the missing atob functionality. Now we wish for a compress(1)-like compressor using a similar idea.
Why is the sad smiley on line 12 in that particular place?
This program is in implementation of the unix zcat(1)
command for printing
compressed files that have been created by the compress(1)
command.
The adaptive LZW compression
employed by compress
usually requires a fairly large amount of memory to
build the decompression tables (in fact, compress
has a command-line option
to control the size of the decompression tables, since some of the early
systems to which it was ported (coughMS-DOScough) did not have enough
memory to use the full 16-bit codes). This version has no large static or
dynamic tables. It is able to decompress 16-bit files using a single function
and a few ints.
To run the program, take an existing compressed file and pass it to the program
on the command line. On the odd chance you don’t have compress
hanging about
on your system, I’ve provided compressed copies of this year’s IOCCC rules
and guidelines files. To print them, just do
./prog ioccc_guidelines.txt.Z
The basic cleverness here comes from realizing that the compressed data is itself an encoded version of the code dictionary. When the encoder sees an input string that is not in its dictionary, it emits the code corresponding to the longest prefix that is in the dictionary and adds a new entry consisting of that code followed by the next input character. This means that when the decoder wants to know how to decode a code, it just needs to find the spot in the code-stream where the encoder added that entry to the dictionary. The code at that spot is the prefix, and the first character of the following code is the suffix. Decoding then just becomes recursion, to follow the chain of prefixes and suffix characters.
Of course, it’s not nearly that simple. Finding the code in the stream is complicated by several issues:
compress
encoder reserves a code word that clears the coding tables.
When this is encountered, the computation of offsets and bit sizes reset.compress
decoder is designed to always read a block of eight code
words at the same time, so that it need only read N bytes to get eight N-bit
code words. When the “clear” code is encountered, the entire eight-word
block is flushed (since the decoder didn’t know the word size would change).
The program needs to account for this extra flushing.Managing all the details involved added a lovely additional layer of obfuscation to the otherwise elegant recursive decoder. Still, the code manages it in a remarkably small amount of code.
The decoder doesn’t look at the header. If you supply something that is not compressed data, it’s likely to crash in an interesting manner.
In particular, it doesn’t look at the third byte of the header, which encodes
the maximum code word size and whether “block mode” was used when the encoder
was run. If you use compress
in its default mode, the maximum word size
will be 16 bits and “block mode” will be enabled. You are unlikely to find
any files that don’t use “block mode”, since all versions of compress
after
2.0 (circa 1984) use it by default and have no documented way to disable it.
Reducing the bit depth is still possibly by using the -b
flag. Modern
systems have no problem with the 200 KB or so needed for 16-bit decoder
tables, so the need for reducing the table size has passed, hence this program
always assumes 16-bits… However, should you find a file that needs a
smaller table, feel free to change the hard-coded 16 in the source into the
appropriate number and recompile.
In the event that you don’t supply an argument, or that argument doesn’t name a readable file, the program will just silently exit. This is a feature.
The program depends on having a little-endian system. On a big-endian system
you will get a wierd big-endian version of compress
, which might have come to
pass in an alternate universe where the vax on which compress
was authored
had been big-endian. Luckily for us, it wasn’t, and neither are most of the
systems you might want to compile this for.
© Copyright 1984-2016,
Leo Broukhis, Simon Cooper, Landon Curt Noll
- All rights reserved |