Ruby Regexps and Unicode
In Ruby 1.8 strings have no encoding associated, they are only a handful of bytes from Ruby's view. Regexps are agnostic in that sense as well they match bytes against bytes. Unless you pass one of the flags /u for UTF8, /s for SJIS, or /e for EUC-JP. By the way note that /s in Ruby has a different meaning than in Perl, and it is not the only flag that conflicts.
If you set $KCODE to "u" then source code itself is assumed to be UTF8 and Ruby turns the /u flag on. Ruby on Rails does that since version 1.2 for example.
AFAICT it is not clearly defined which support does Ruby 1.8 provide for Unicode in regexps. For example Flanagan & Matz have little about it except for some vague descriptions. You could say it is just not supported, but some things do work. For example, it is a known trick that counting /./ matches gives you the length of a UTF8 string, whereas #length returns number of bytes.
A couple of important bits with definitely partial support are the character classes \w and \s (and thus their negations \W and \S).
In general, the definition of a word char depends on the locale. In Catalan "ò" is a word char. Regexp engines are locale-aware and the meaning of \w depends on it. That is, \w is equivalent to [a-zA-Z0-9_] only in ASCII-like locales. In Ruby, if source code is UTF8 and /u is enabled "ò" matches \w.
That's important of course, a Rails application that validates domain or account names against \w for example is permitting accented letters. If they should not be allowed you need to write the character class explicitly: [a-zA-Z0-9_].
On the other hand, since "ò" and friends match \w you could be tempted to validate Unicode against \w, I certainly have beed more than tempted :-). Wrong! There are characters that match but shouldn't. For example "¿" or "¡", or "·".
With whitespace there's also poor support. NEL (U+0085) belongs to \s, but it doesn't in Ruby 1.8. A string that consists of NELs not only is not blank in Rails, but it in addition matches \w in Ruby 1.8! Two gotchas for the price of one!
If you need proper Unicode support, among other goodies, you switch to using Oniguruma. That's the regexp engine used in Ruby 1.9, which is available for 1.8 as a gem:
sudo gem install oniguruma
That needs a C library available as a tarball, and also packaged for Ubuntu (at least):
sudo apt-get install libonig-dev
The API is here.
