17 Jan 2008 roozbeh   » (Master)

Bit manipulation: Ages ago, in October 2005, Federico asked for improvements to a certain g_utf8_offset_to_pointer() function.

This resulted in an optimization match by various people, which Behdad has somehow summarized here (read also the comments).

Fast forward to December 2006, when I was going over the new Unicode book and was trying to make sure Gnome and friends are Unicode compliant. One of the bugs I filed was this one, and some of the answers I received somehow discouraged me from continuing the effort which basically led me to stop the whole thing. (The bug is about getting rid of legacy support for an old version of UTF-8 which is now considered by the Unicode Standard to be a security problem.)

Then, last month I have been reading some draft material Donald Knuth is putting online, for his infamous Volume 4 of The Art of Computer Programming. One of the pre-fascicles he has put online is about Bitwise Tricks and Techniques, which I really enjoyed reading. Knuth, being a Unicode fan, had inserted some interesting excercises, regarding UTF-8 and UTF-16.

One of the exercises included a magic (!) formula to replace the utf8_skip_data array (see Federico's post again). It is provided in exercise 197.

Knuth's formula not only needs no memory reference, it's also branch-free (which is considered very good for many modern CPU architectures). The formula does it with four operations, which would become five when adapted to the present formulation used in glib. The only problem is that it only works for proper UTF-8, the version the Unicode Standard requires, but not glib's UTF-8.

I tried to extend Knuth's formula to glib's UTF-8, and did it on paper with two more operations (seven instead of five), using 64-bit boolean arithmetic.

After chatting with Behdad, he told me it's not really worth it to replace the array with the formula (I cannot understand the reasons well enough to explain them here, but I trust him), but he was interested in seeing my extended formula.

So last night, I tried to make sure my formula works fine before emailing it to Behdad. And I found a bug, which meant that I needed to add two more operations to get it done properly, a total of nine operations.

This is my new formula, which is tested and works fine. It may not provide exactly the same results as the utf8_skip_data array for all values, but many of the array's cells are redundant. For necessary cells, it provides the same results:

def utf8_skipper(c):
  t = (c >> 1)^0x7F
  return ((0x924900009201B128 >> ((t & ~(t
>> 1))*3)) & 7)+1

Can you do it in less that nine operations? Or with 32-bit boolean arithmetic only? [With no branching or memory access, of course.]

This may just be a mental exercise, but please email me if you could, as I'm starting to feel an affection towards the problem!

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!