Met with Rasmus, JByers, and a few others in San Francisco (at LinuxCare) to discuss internationalization in PHP. The PHPi home is at SourceForge at http://php-i18n.sourceforge.net/, but it's just getting off the ground. There are a few efforts out there that have started internationalization on different levels. Hiro (can't remember his last name =(, while working for a Far East web portal added JIS and other Japaneese support to PHP to accept form POST vars. It seems like it would be a good starting point to see what problems he ran in to.
On the other hand, an IBM project called ICU exists as an apache/php module. It seems quite messy, written in C++ and prone to bring down the apache thread if not handled with care. Carl, the contact at IBM, said that it was under a sort of BSD license, so hopefully we can fix up whatever is wrong with it and see what it affords us. They seem to have much of the VERY specific work done, including sorting charts, multi character glyph grouping, etc. It was done using a collate function that normalizes the input string to separate out diacritical marks (accents) and group characters and then run it through various levels of sorting (exact, whitespace insensitive, case insensitive, etc.) Looks very useful, but it looks like more than we would need.
The final debate was on how to handle the difference between UTF-8, UCS-2, and differentiating between them and high ascii. There seems to be no good way at all (is a form being submitted in multibyte japaneese, or is it a JPEG). When we do a strlen() on it, do we get the number of bytes or the number of characters. Hopefully someone has some magic solution to this one.