Better Collation Rule Markup: a critique of Locale Definition Markup Language

Posted 25 Jan 2006 at 09:20 UTC by GrahamAsher Share This

Unicode, Inc., administers and maintains the Common Locale Data Repository (CLDR), in which are stored XML files containing locale information, including collation tailorings, for many different locales. The format of the XML files is called Locale Data Markup Language (LDML).

In this paper I criticise some features of those parts of LDML that deal with collation rules, and suggest improvements. A concrete syntax for a new improved collations section may be described in a further article if I have time.

This work was prompted by my experiences in writing an automatic tool to parse LDML files and data write tables to be used by C++ collation functions.

Visual ambiguity

LDML represents characters as actual Unicode text, stored in whatever format the XML file is in, usually UTF8. It is thus impossible for human beings to read the file in a way that is guaranteed to be correct, without the aid of software tools. For example, here is a section from tr.xml, the tailoring file for Turkish:

<reset before="primary">i</reset>

<p>ı</p>

<t>I</t>

<reset>i</reset>

<t>İ</t>

Because it is Turkish, and because people in this field tend to know that Turkish does special things with dotted and dotless Is, we read the I in <t>I</t> as U+0049 LATIN CAPITAL LETTER I. But in fact there is nothing here apart from our wider contextual knowledge that tells us this. The visual appearance is ambiguous. Some other candidates are

U+0399 GREEK CAPITAL LETTER IOTA

U+0406 CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I

U+04C0 CYRILLIC LETTER PALOCHKA

U+FF29 FULLWIDTH LATIN CAPITAL LETTER I

and there are still others.

Of course it is relatively easy to write an XML parser that reads the file and writes it out as a readable format, but that is not enough - the file itself should be directly readable. I shall now state

Design dogma 1. The file format must enforce unambiguous readability by human beings, without any need for special knowledge.

By `enforce', I mean that it must not be legal to write the data in an ambiguous way. The unambiguous way must be the only legal way.

Human readability and editability is very important at this stage, at which human beings are generally responsible for data entry and checking. Automated tools step in from this point onward, but their output becomes much harder to debug if the initial data files are enigmatic and obscure.

Visual obscurity

Many characters that are not ambiguous are nevertheless extremely hard to identify. Again, the file is not easily readable or editable by human beings. Here is an example from ja.xml, the tailoring file for Japanese

<pc>亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦鯵梓圧斡扱宛姐虻飴絢綾鮎或粟袷安庵按暗案闇鞍杏以伊位依偉囲夷委威尉惟意慰易椅為畏異移維緯胃萎衣謂違遺医井亥域育郁磯一壱溢逸稲茨芋鰯允印咽員因姻引飲淫胤蔭院陰隠 ... </pc>

Again, this file is impossible to understand, or check for validity, without parsing it by computer, except for Japanese speakers and even then with some difficulty, and of course there is no hope of comparing Unicode values. While initial data is always prepared by - or in consultation with - native speakers, most programming work on collation tailoring is necessarily done by people who don't understand most of the languages they are dealing with.

Difficulty of editing

The two file snippets shown above are difficult to edit, even with the full apparatus of multilingual text editing software with input method editors. There is no advantage in the superficially visually attractive method used in LDML at present, of including the characters to be ordered as actual raw Unicode text.

The software engineer who is working with these files will usually be using an IDE with a built-in editor with limited multilingual capabilities, and even if those capabilities are present, he or she may not be familiar with them. This leads inescapably to

Design dogma 2. The file format must use ASCII characters only.

Taken together, the first two design dogmas suggest using hexadecimal notation for Unicode values, perhaps with a minimum of four digits to better express the fact that they are code points and not other types of values. Thus the I mentioned above would be 0049.

Semantic problem 1: no recognition of character identity

The Arabic collation tailoring rules, as given in ar.xml, contain a single rule:

<reset>ة</reset> <i>ت</i>

It tells us that U+0629 ARABIC LETTER TEH MARBUTA must be sorted as identical to U+062A ARABIC LETTER TEH.

There are, however, two other variants of TEH MARBUTA:

FE94 ARABIC LETTER TEH MARBUTA FINAL FORM

FE93 ARABIC LETTER TEH MARBUTA ISOLATED FORM

The Arabic rules are in error here: they should contain two extra rules making the other two teh marbuta variants identical to the corresponding teh variants. These rules are easy enough to add in this case, because there are only two variants - but what if there were a hundred?

For instance, the standard keys given in allkeys.txt (the Unicode, Inc., file containing non-tailored keys for all Unicode characters apart from those for which keys are generated algorithmically) specify over 40 different variants of F, all with the same character identity. The fact that they are all the same letter is implied by their sharing a primary weight.

Therefore we need to be able to talk about the ordering of the basic letters - the Platonic ideal characters, not their varying exponents as Unicode code points. The rule in Arabic is evidently not ``U+0629 = U+062A'', but ``teh marbuta = teh''.

Design dogma 3. It must be possible to express rules using basic character identity.

Semantic problem 2: over-specification of non-primary weights

To tailor Turkish we need (among other things) to establish dotted and dotless I as separate letters, with dotless I coming before dotted I. The present LDML format contains the rules already quoted above, with my comments added in italics:

<reset before="primary">i</reset> insert a letter before i

<p>ı</p> the letter to insert is lowercase dotless i

<t>I</t> uppercase dotless I is the same letter but differs at tertiary level

<reset>i</reset> start again just after dotted i

<t>İ</t> uppercase dotted I goes here and differs at tertiary level

When we come to insert dotted İ after i, how do we decide the tertiary weight? The rules imply that the İ is inserted immediately after i, with a tertiary difference. Now the key specified by allkeys.txt for i is:

[.103C.0020.0002.0069]

which means that the primary weight is 103C, the secondary weight is 20, the tertiary weight is 2, and the quaternary weight is 0069 (all numbers are hexadecimal). The obvious thing is to increment the tertiary weight to 3. But, by a strict interpretation of the LDML syntax we can't do that, because the key [.103C.0020.0003.0069] is already taken, by FF49 FULLWIDTH LATIN SMALL LETTER I.

Therefore we must resort either to comprehensive reordering of the many I variants, or appending a disambiguating trailing key.

But all this is absurd - it was never the intention of the designers of Turkish collation rules to specify any sort of ordering between dotted I and fullwidth I. The tertiary ordering is over-specified. As with the Arabic example, what is needed is a rule that captures the new letter ordering: something like ``dotless I < dotted I''.

The non-primary weights for the I variants can then be retained. Only the primary weights, representing basic character identity, need be changed.

Semantic problem 3: loss of non-primary weight meaning

Primary weights below the value 8000 all have the same meaning. Each primary weight identifies an idealised character identity.

Secondary and tertiary weights, on the other hand, have important meanings that can and should be used by collation software. For example, ordinary lower-case or non-cased letters have the tertiary weight 2, while uppercase letters have the tertiary weight 8. (See http://www.unicode.org/reports/tr10/#Tertiary_Weight_Table for details.)

These meanings are effectively destroyed when an LDML file is parsed. Clever software can sometimes restore the appropriate values, for example noticing that an uppercase letter has been reordered at tertiary level, and assigning it the tertiary key 8, after making sure that no violence is done to the meaning of the collation tailoring rules - but generally this is impossible without making assumptions along the lines of ``the rules say that, but this is what they really mean''.

Design dogma 4. Secondary and tertiary weights must be explicitly and symbolically stated where necessary.

This dogma needs a little explanation. By ``symbolically'' I mean that the key must be specified using a name that expresses its meaning. For example, the Spanish collation rules introduce a new basic character identity for ñ (n-tilde). Rather than saying that ñ comes after N at primary level and Ñ then follows at tertiary level, we really want to say something like: ``new primary n-tilde = 00F1'' (declare a new basic character identity called n-tilde) then ``n < n-tilde'' (the new character comes after n) and finally state that ``00D1 = capital n-tilde''. (Please note that these rules are not given as examples of suggested rule syntax, but simply as a way of showing the sort of information that should be expressed and some words that might suffice to do so.)

The advantage of this method is that it is now possible to use the same collation key table for variant tailorings that treat secondary and tertiary weights differently. The most important of these is the ability to swap the ordering of upper and lower case letters, which can only be done if the tertiary weights have their usual meanings.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page