January 2010

Ever tried to learn a language, get fit, or write a book? Spent a considerable amount on a linguaphone course or gym membership? You felt a high when you spent the money and thought that you were halfway to your goal, only to find that a month or two later you’re no nearer to success than you were at first – except your wallet is now much lighter?

Well, you’re not alone.

Other similar projects that I’ve seen can include inventing new designs for some embedded computer hardware such as an Arduino, or developing a web site and purchasing the server before writing any content.

What was your project?

Here’s a neat explanation of why there can never be a single algorithm that can compress all files, even if only by one bit each.

  1. Take the set of all files of 10 bits length. There are 1024 files. Call this Set 1024.
  2. Now take the set of all files of 9 bits length. It is one half the size of the previous set, with 512 files. Call this Set 512.
  3. Let’s take a compression algorithm and compress all the files in Set 1024. Each output file must be found in Set 512, as it is by definition compressed and of 9 bits or less.
  4. But note, that’s a mapping of a set of 1024 files to a set of 512 files. There must be at least one output file in Set 512 which is mapped from two or more input files in Set 1024. Call this File A. So, when we decompress File A from Set 512, which of the files in Set 1024 that it is mapped from should it return?


There’s a wrinkle here that I didn’t appreciate when I first wrote this. I should have used the word string rather than file. The problem with using files is that in many cases the same compressed output can be stored differently, by using different length files. E.g. the output 000100101 could be stored as a 9 bit file, or 8 bit file, or 7 or 6 bit file! So the output set, when using files, has a total space of 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 + 512 = 1022, rather than 512. However, the basic explanation still holds, because 1022 < 1024, so some output files must map back to more than one input file.

For my web site arabicreader.net I need to dynamically create pdf files that contain the users arabic vocabulary that they wish to revise.

This was done using python and some libraries that I created some time ago to output unicode arabic to postscript. I dynamically create the postscript file (postscript is a language of its own and it’s well worth spending a few hours reading the official developers documentation from Adobe) and then convert it to pdf using ps2pdf. I instruct ps2pdf to embed the font ( an old free type1 arabic font) so that the pdf is readable by anyone even if they do not have that font available.

The code to convert unicode arabic to postscript involves shaping the arabic (getting the correct glyph for a character according to the characters before and after it) and creating an encoding that maps the glyphs in the font to the character codes that I output. It also calculates the length of the resulting shaped text so that it can be placed, right-to-left and aligned to the right, at the correct x,y coordinate.

Also, because learners of arabic need to learn the tashkeel, the algorithm also places the harakaat nicely above and below the glyphs according to the size of the glyph. Without this the harakaat tend to overlay the glyph and are unreadable. I do this by stripping the harakaat out of the word before outputing it, and then adding each harakat in a second sweep, taking into account the dimensions of the glyph it is over.

I will be releasing the code as open-source but it’s not yet published. Contact me if you need it now and I’ll email it to you.