21 Feb 2009 roozbeh   » (Master)

Fonts and Languages: I was repackaging my fonts for Fedora 11, when something caught me. The font packaging policy involved the list of languages my font package supported. But it was a font with a wide range of Latin and Cyrillic glyphs, and it probably supported dozens of languages. Happening at the same time, I found that Fedora 11 is considering supporting automatic font installation. Among various things, this means that we need to know which fonts support which languages.

Font files don’t have that information directly. How would a font designer know that his font supports Arbuan Papiamento just fine, which uses a different orthography than Papiamento as written in Netherlands Antilles, for example? What about African or native American languages? Or Mongolian? Or Kurdish? He just designs and tests glyphs for characters and languages he is interested in. If the resulting font happens to support Filipino too, good for him and his users, if it doesn’t, he may not care. At best, a list of the languages the font designer believes the font is supporting may be found somewhere in the documentation.

In the present freedesktop stack, the language support detection task is done by fontconfig. When an application, like Firefox, wants to display text in some language, a text layout engine, like Pango, will ask fontconfig for a font that supports displaying text in the language (possibly with some other properties, like the font being bold and sans serif). fontconfig then uses its various font suggestion rules and orthography files to give the best font it can find back to the engine. If FontConfig doesn't know anything about the language, or has wrong information, it may give you something totally off, like a Latin or Devanagari font for a language written in the Arabic script.

What font designers may not know (or care about), fontconfig needs to know. The usual way of knowing, especially for not-very-famous fonts or languages, is through orthography files. These files contain a list of Unicode characters that play a letter-like role in the language. For example, for French, it is a list of basic Latin letters plus all the ligatures (like œ) and accented letters (like ï). fontconfig runs the list through each font installed on your machine and sees if it has glyphs for all the characters listed. If it does, the font is assumed to support the language.

Getting back to my own story, I thought of checking orthography files to see which languages my packaged fonts support. But when I looked into a few, I found several bugs and unsupported languages. Behdad encouraged me to fix them early, for a chance for them to get them into fontconfig 2.7.

During the past few weeks, I’ve been trying to hunt things down and fix them during my free time. I achieved my first target of matching glibc locales (those without ‘@’). I’m now on my second target of matching languages with two-letter codes; remaining are: Akan, Avestan, Cree, Ewe, Herero, Sichuan Yi, Javanese, Kanuri, Kongo, Kuanyama, Luba-Katanga, Nauru, Navajo, North Ndebele, Ndonga, Ojibwa, Pali, Quechua, Rundi, Sango, Shona, Sundanese, Tahitian, and Zhuang. After that, there are thousands of languages with three letter codes, which would need an army the size of SIL International.

Everything I did is in my git tree here. If you want to help, file bugs with your findings at http://bugs.freedesktop.org/. You can also check out the existing orthography bugs to avoid duplication.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!