Kimmo Fredriksson
kimmo.k.k.fredriksson@gmail.com
make fredriksson
./fredriksson [-icvtnk#] regexp < file
./fredriksson -k1 -i POOT < /etc/group
ulimit -s unlimited
./fredriksson -k2 -t -n bofuskate < /usr/share/dict/words
The author of this entry was making sure that it does not win the “Best Short Program” award. On the outside, the program behaves as an approximate grep. On the inside, it is a Burrows-Wheeler transform decompressor that produces the source code of the actual obfuscated approximate grep program and calls it. As an exercise, try writing the compressor.
Implements a variant of grep.
To search regexp from file, do:
./fredriksson [-icvtnk#] regexp < file
Pipes may not work, see the section Limitations.
The options -i -c -v
and -n
are the classic ones:
-i
Ignore case.
-c
Count only, prints the number of matches, but not the matching lines.
-v
Invert match.
-n
Prefixes each printed line by its line number (only if
-c
was not given).
This version of grep recognizes two more options, not present in normal grep implementations:
-kN
Permit N
insertions, deletions or mismatches of characters in the
matches (aka edit-distance). E.g. “obfuscated” matches “obfuscation”
with one deletion and two mismatches. Likewise, “obfuscate” and
“oversimplify” match with 7 edit operations. The default is -k0
(i.e. exact match).
-t
Add local transpositions to the set of allowed edit-operations.
That is, “ab” matches “ba” with one transposition (swap), and
“obfuscated” matches “bofsuact” with three swaps and two insertions,
i.e. use -tk5
to find this.
Of course, one could construct a standard regexp that matches the same patterns, but the problem is that such a regexp grows exponentially in size when the number of allowed edit operations is increased.
This version does not recognize all regular expressions, but the following are allowed:
Wild card: .
(dot) matches any character.
Bracket expression: list of charactres enclosed by [
and ]
,
matches any single character in the list.
Range expression: two characters separated by hyphen and enclosed in
brackets, matches any character that is lexicographically between the
two characters (inclusive). E.g. [a-d]
is the same as [abcd]
.
The special characters must be escaped, if they are to be taken
literally (i.e. use e.g. \.
to match period). You might want to protect
these with '
, i.e. use 'foo\[bar'
instead of foo\[bar
. If you want to
use the literal -
in bracket expression, protect that too (\-
),
otherwise this is not needed. Note that the syntax differs from standard
(?) grep a bit, i.e. in plain grep you can use ]
and -
in bracket
expressions by putting ]
as the first character, and putting -
last.
You can use this in place of cat. Just say
./fredriksson -k3 cat < file
Like with cat, you can append line numbers too:
./fredriksson -nk3 cat < file
It is assumed that the user knows how to feed correct input to the program. If undefined options are given, the result is also undefined. This is a feature.
It may consume huge amounts of memory, and a stack overflow may occur.
This results in Segmentation fault (or the equivalent of your favorite
system, such as Bus error). This is a feature, that can be used to detect
overly long lines in the input file, such as checking the lenght of one
liners in IOCCC entries, just adjust your stack size to a suitable
threshold, and search for “ioccc”, with option -vk5
.
Using warning options (such as -Wall) when building gives a lot of warning,
such as suggest parentheses around '&&' within '||'
, value computed is
not used
, implicit declaration of function 'putchar'
, implicit
declaration of function 'getchar'
, control reaches end of non-void
function
and format not a string literal and no format arguments
.
Please ignore them all, since they just try to tell you that the source
is obfuscated.
Only standard input is handled. And only if stdin comes from a file,
i.e. pipes may or may not work. If you want to count only (-c
option),
then this shouldn’t be an issue.
The total lenght of the (preprocessed) pattern is at most
sizeof(long)*CHAR_BIT
. That does not count things like bracket expressions
or escape chars, that is, the pattern 'a[bcd]e\.f'
is 5 chars long, not
10 chars. That does not mean that the matched substring in the standard
input has this limit, as using the edit operations can make the string
longer. However, no assumptions are made for sizeof(long)
. In fact,
you can easily use long long
instead,
Depending on the file, may require a lot of stack space. If the program segfaults (because of this), say
ulimit -s unlimited
cross your fingers, hope it is enough, and try again.
Assumes C99 standard, because of the one very long line. Most C compilers handle this just fine even in C89 mode, though.
The code should be reasonably portable, it assumes standard ASCII codes for each char, and that a C compiler is on $PATH; the latter is hard coded to be ‘gcc’, but any compiler would work, provided that it supports C99, or long lines as an extension. This code has been tested on:
(Btw, tcc has its roots in IOCCC…)
The initialized char array includes the source code of the real program, as well as its name and commands to build and run it, all Burrows-Wheeler -transformed, then run-length-encoded with unary coding. The main program extracts and builds that, and then calls it. More precicely, the command to build is
gcc -O3 ag.c -o ag
To change that you need to edit the char array p[]
and (possibly) two
specially encoded constants, -~('('*'(')
and 'C'*' '-'-'
.
The extracted source (ag.c
) is obfuscated as well.
(The original version also deleted the generated source and the corresponding binary after running it, but this feature was removed from the submitted entry.)
The code demonstrates several useful programming paradigms, such as
main()
contains several logical sub routines,
selected by the first argument)Algorithmically, it incorporates the following
As a C program, it demonstrates that
int
, char
, void
…))It also shows the power of boolean logic and bitwise arithmetic.
This is the first (?) entry in the history of IOCCC, that cannot be totally un-obfuscated by the author. This is due to the clever choice of the actual algorithm, that is obfucated by nature, and it is an open research problem whether a simpler algorithm exists, with the same time complexity.
The main obfuscation comes from using only arithmetic and boolean
(short-circuit) logic combined with recursion to remove all loops and
if-then-else constructs (also, no ternary ?:
constructs are present).
Two’s complement representation of integers is also abused. E.g. instead
of writing x++
or x-1
, it is much better to write x=-~x
and ~-x
.
No numerical values, other than in the initialized char array, try
./fredriksson -ck0 [0-9] < fredriksson.c
This should print 1, as there is a numer ‘3’ in the char array (only). You can also try
./fredriksson -ck0 [0-9] < ag.c
after running fredriksson at least once (since this generates that
ag.c
file), and this should print 0.
The source code is best viewed with cat (or using the program itself), instead of some editor. (This is a somewhat system dependent thing.)
One reason for this is that the code builds (some of the) constant
values from string literal "\b"
, except that the backspace \b
is not
escaped like that, the source uses the raw ascii value (010 octal) instead.
Thus when viewing with cat or something, that backspace erases the leading
quotation mark. Makes it appear uncompilable, and it is: if you copy-paste
the cat output from terminal to some editor, save it, the result would not
compile.
Some other constants are built from ASCII values of chars.
Besides of making the code hard to read, all this makes it also quite large.
Running it through GNU indent or bcpp (C beautifiers) does not beautify
it a bit. In fact, indent makes it only worse (at least with the default
options). Running it through the preprocessor does basically nothing,
since no #define
s are used.
Some abuse of the rules. The number of non-whitespace and ;
, {
and }
chars is below the limit, but most of the whitespace chars are
in an initialized character array. That is, the number of chars in the
program exceeds the 2048 limit. However, whitespace is whitespace and
the rules did not specify that these must be in between the statements.
A character is a whitespace if isspace()
returns non-zero value. Even
if deleting even one of those chars breaks the program! (This does not
apply to the generated source ag.c
.)
© Copyright 1984-2012,
Leo Broukhis, Simon Cooper, Landon Curt Noll
- All rights reserved |