12 Oct 2010 roozbeh   » (Master)

Unicode 6.0 was released today. Here is the link to the announcement: http://www.unicode.org/press/pr-6.0.html

The following changes should be interesting to the Persian and Iranianist computing community (based on an original post to the Persian Computing list):

  • Sixteen symbols have been encoded in the Arabic Presentations Forms-A block for use in pedagogical materials and documents discussing the features of the Arabic script.

    Please note that these are not combining characters but stand-alone symbols. These should only be used to display the dots and diacritics in isolation, and not for making new letters. For example, one can *not* use a Seen and add U+FBB6 Arabic Symbol Three dots Above to get a Sheen. If you type that, you will get a Seen followed by three dots. According to the standard, "These are spacing symbols representing Arabic letter diacritics considered in isolation, as for example as in discussions about the Arabic script."

    Updated Unicode chart:

  • The Qur'anic character U+06DE ARABIC START OF RUB EL HIZB has had its glyph and properties changed.

    For some unknown historical reason, the character was mistakenly classified as a combining character instead of just a symbol, which made it unusable. The character is now a normal spacing symbol and is usable as originally intended.

    Background document for the change (which I authored):

  • Two characters have been encoded in the Arabic script block for use in Kashmiri, one of the official languages of Jammu and Kashmir, the Indian-administered part of Kashmir. The language is written in both Arabic and Devanagari, along religious lines of Muslims and Hindus.

    The two new characters are U+0620 Arabic Letter Kashmiri Yeh and U+065F Arabic Wavy Hamza Below. Also, U+0673 Arabic Letter Alef With Wavy Hamza Below has been deprecated (the first Arabic script character to ever get deprecated in Unicode), and the character sequence <U+0627, U+065F> should be used instead of it.

    Unicode proposal (I'm a coauthor):

    Updated Unicode chart:

  • Mandaic has been encoded. Mandaic is the script used by the Mandaeans (mostly living in southern Iraq and southwestern Iran, especially Khouzestan) for liturgical purposes. This the community that some people believe the Qur'an refers to as Sabians, the third member group of the People of the Book (next to Jews and Christians).

    Michael Everson's proposal:

    Unicode chart:

  • Brahmi is also encoded, which is of use to Iranianists (some Iranian languages like Khotanese have been written in Brahmi).

    The most detailed proposal (although not the final one that got encoded):

    Final Unicode chart:

  • Unicode Standard Annex #9, The Unicode Bidirectional Algorithm, has been updated to include more information and some clarifications. Note that the algorithm has not changed. The update just explains the original intentions in more details. For the list of informational changes to the text, see the following link (Behdad Esfahbod and I have contributed to this and previous versions of the standard annex):

  • A new data file has been added to the Unicode character database, listing some characters that are used with several scripts (and which scripts those are). For example, from the data file one can learn that the Arabic Tatweel and some of the Arabic harakat are also used with the Syriac script, the Arabic-Indic digits are also used with Thaana, and the Arabic comma, semicolon, and question mark are also used with both Syriac and Thaana:

  • More than a thousand new symbols have been added, including lots of symbols that you can find on electronics, maps, menus, signs, etc. Most of these were added to support Emoji, symbols mostly used on Japanese mobile phones for text messages, emails, chat, and even cellphone novels:

    For you chart browsers over there, here are some of the blocks that contain the new symbols (color-coded yellow):
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F0A0.pdf (playing cards)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf (lots of interesting new symbols, including symbols for beverage containers)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F600.pdf (emoticons, also known as smileys)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F680.pdf (transport and map symbols)

    Please note that Unicode encodes beverage containers, but not alcoholic beverages (I personally made sure of that, to reduce possible objections). For example, there is no BEER encoded, but only BEER MUG (which is also used for non-alcoholic beer, among other uses).

    Religiously devout people that may object to some game characters or musical instruments getting encoded should note that Unicode implementations are not required to support any specific character, and are allowed to choose their own set of characters to support. The game symbols are encoded only for the sake of Unicode implementations (especially those in East Asia) that need them to support their users.

  • And finally, the official detail of additions and changes to the standard, for the hardcore:

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!