Fonts and CSS

Posted 19 May 2004 at 21:06 UTC by ramoth4 Share This

With the advent of CSS, it is possible to create incredibly typographical effects. However, only a half-dozen fonts are commonplace enough to be used online. I propose a solution to this problem.

I recommend that the font() function be added to CSS.


font(PATH, TYPE)

The first argument is the path to the font file. It should follow existing CSS conventions about paths. The second argument is a MIME type. User Agents can ignore a font if they do not support its MIME type, thus saving bandwidth and increasing load time (alternatively, the font would need to be downloaded and examined). In addition, User Agents would be encouraged to cache fonts. Combining this extension with CSS text effects such as shadowing and transparency could lead to some interesting and creative results.

Example usage:

div.fancy {
  font-family: font("../fonts/", "x-font/truetype");
  color: rgb(128,255,128);

I would appreciate some feedback on this idea from Advogatons before I submit it to the W3C lists.

embedded fonts, posted 19 May 2004 at 21:20 UTC by brondsem » (Journeyer)

Internet Explorer supports Embedded Fonts but mozilla, et al, doesn't seem to. probably some proprietary restriction. Good luck!

Re: embedded fonts, posted 19 May 2004 at 21:25 UTC by ramoth4 » (Journeyer)

Interesting. Someone else mentioned that, but I figured they were talking about those plugins that existed back nearly 8 or 9 years ago.

Also, a correction to my example:

font("../fonts/", "application/x-truetype")

embedded bitmap fonts, posted 19 May 2004 at 22:48 UTC by mslicker » (Journeyer)

Near the end of this article there is a proposal for embedded bitmap fonts. The author seems mainly concerned with copyright issues, however I believe the proposal is a very elegant solution to adding advanced typography to web pages.

From the client side it is very cheap to implement, you don't have to worry about every kind of font technology invented (TrueType, OpenType, Type 1, ect). This is important if the w3c is truely concerned with interoperable technology.

Bitstreams format is open, posted 20 May 2004 at 14:04 UTC by jaldhar » (Journeyer)

The specs for the PFR format used in Bitstreams TrueDoc font embedding technology are available. IIRC it uses a combination of CSS and meta tags.

Argh!, posted 20 May 2004 at 14:17 UTC by bmastenbrook » (Master)

No, no no. Instead of giving sociopathic web developers even more control with which to pick egregiously bad fonts, the solution to the problem is to come up with a language describing the type of font intended, and let the local browser pick a font to satisfy that role. If you want absolute control, of course, there's always PDF, but on a web site, do you really need Apple Chancery, and not Bitstream Vera Serif or Times New Roman? Will it just look SO WRONG that it's worth it to make the browser download the font you want?

I'd rather see CSS acquire a language for describing the type of font - Serif, Sans Serif, Black / Heavy, Thin, Monospaced, etc. and let the browser use a few hueristics. Then I could write another CSS file which could even constructively override those fonts - eg replacing Bitstream Vera Sans for all of the sans-serif fonts requeste,d and Bitstream Vera Serif for all of the serif fonts, without destroying the original intention of the page. (CSS can't do this currently.)

Re: Argh!, posted 20 May 2004 at 14:32 UTC by jaldhar » (Journeyer)

CSS already has a mechanism for describing fonts. The important part is the downloading. Right now if I wanted to build a web site in an Indian language, I'd pretty much have to specify Internet Explorer. Most users are not savvy enough to install fonts and for non-roman scripts, having the wrong font doesn't just make the site ugly but unreadable.

Remember, Fonts Can't be Copyrighted, posted 20 May 2004 at 15:13 UTC by johnnyb » (Journeyer)


FYI fonts actually can't be copyrighted. Font _software_ can be copyrighted, but not the fonts themselves. Some people think that certain font hinting routines can be copyrighted, but I've seen no actual evidence of this. The copyright office a while back said that fonts, even electronic ones, are not copyrightable.

Re: Remember, Fonts Can't be Copyrighted, posted 20 May 2004 at 16:31 UTC by slamb » (Journeyer)

I'll believe that one when I see third parties legally selling the Adobe Type Library for cheap, cheap, cheap.

Re: Remember, Fonts Can't be Copyrighted, posted 20 May 2004 at 18:23 UTC by mslicker » (Journeyer)

johnnyb, It seems this is the case in the U.S., althougth their may be more to it than that. For instance, is this legally binding?

My interest was not in the protection of the font designs, but consistency and ease of implementation. It seems web designers will try for WISIWYG (What I see is what you get), no matter how difficult the technology makes it to achieve. The days of Netscape vs. Microsoft, table layout, font tags, were particularly burdonsome. CSS is an improvement. Ideally a web page would be a decription of the content; styles with precise layout semantics could optionally be applied.

Further, I like the idea presented in the article, that the browser would give a hint as to what media the page is viewed on perhaps in the HTTP header. The resulting page could be optimized for the media, be it 8x11 200 dpi inkjet, low resolution lcd (320x200), ect.

@font-face, posted 20 May 2004 at 18:54 UTC by MisterBad » (Master)

The @font-face directive was present in versions of CSS2 previous to 2.1. It's how dynamic fonts work.

Netscape Navigator 4 supported it, but the code was 3rd-party and was removed when the source was released to become Mozilla.

@font-face is part of the CSS3 draft, but since it's been removed in 2.1 (more recent), it may not survive.

Font copyright, posted 20 May 2004 at 19:03 UTC by AlanShutko » (Journeyer)

In the US, font designs (the way the letters look) can't be copyrighted. Names can be trademarked, and the actual PS or TTF files can be copyrighted. That's because in a weird decision, some US court (i don't recall which) decided that typefaces were purely functional and the designs couldn't be protected. The PS or TTF files, though, do have sufficient room for artistic input to be copyrightable.

The way around this is to print something with the fonts, scan it, and fit curves around with this. slamb, there are companies which do this! Look at the billions of fonts you get, essentially for free, with Corel Draw or WP. Some typeface companies originally built their business on knockoff fonts. They usually have bad or no hinting, and a limited selection of glyphs and variations (small caps, widths, etc) compared to the original.

Security issues?, posted 21 May 2004 at 03:37 UTC by glyph » (Master)

Are existing font parsers well-secured enough against input off the 'net to really make this feasible? IIRC the Xbox linux-boot "exploit" was a bug in the standard Windows truetype font parser.

Aiiee! Comic Sans!, posted 21 May 2004 at 15:10 UTC by abg » (Journeyer)

If this means that Joe Author can make MS Comic Sans display on my machine, I'll have to vote 'No' just out of principle. 8^)

I override everything with Vera Sans * anyways.

paranoid, posted 21 May 2004 at 19:30 UTC by gilbou » (Observer)

i am quite paranoid after some extensive use of openbsd. this fonts to download seems to be a huge and open "hack me please" offer to kiddies and morons around. i would prefer a set of fonts to be developed and freed for all. ho. we got something like that. bitstream offers vera family fonts and they're available in gnome but for windows too ! :D well. i will not let my browser download fonts from a website. first, i like to use my own fonts and set a minimum size because the stupid monkeys that design web sites like to use huge width and screens while most people are between 800x600 and 1024x768. second, i will not trust downloaded stuff i have not carefully checked (i like to check sources - i'm _that_ paranoid) and last, this is fucking my too small bandwidth ;)

I'm tired of people trying to control my browser, posted 22 May 2004 at 06:39 UTC by Omnifarious » (Journeyer)

Stop trying to make your page look pixel perfect and just accept that people's browsers will render things differently. Concentrate on the content of your page and making sure that the structure accurately represents the data and trust browsers to do the right thing.

Re: I'm tired of people trying to control my browser, posted 22 May 2004 at 08:17 UTC by tk » (Observer)

But what about non-Latin scripts, as jaldhar pointed out?

OK, posted 22 May 2004 at 14:50 UTC by Omnifarious » (Journeyer)

Non-latin scripts are a reason for this extension.

But, I agree with some of the stated security concerns. I could see someone hacking font renders to gain control of your machine. I could also see advertisers using it as a way to try to sneak ads past filtering software. It opens up a new path for data that has to be bolted down and made secure.

There has to be some better way of handling that problem that isn't phrased as an imperative "Download this!" to your browser.

My pipe dream, posted 22 May 2004 at 15:42 UTC by mslicker » (Journeyer)

The web is content based. If you want to avoid the whole mess of poorly designed web sites, you are free to do so. Avoiding bad content might be harder. Designers can control the exact look of a web site with optionally applied precisely interpreted styles. The browser is not left with any ambiguity as to what the "right thing" to do was. The job of the web designer (someone who designs professionally) and the browser implementer is made considerably easier. None of this should be (or should have been) technically difficult to achieve.

Every objection presented here to embedded fonts is resolved by the choice of an appropriate browser. Many browsers give an option of what fonts to use for all sites. As for security, if you don't trust a browser to handle fonts securely, why should you trust it to handle any other content securely?

As for non-latin scripts, I'm suprised that the browser has both the ability to handle non-latin scripts, yet not the ability to download the correct non-latin font in case of viewing a site with such a script. Maybe font-embedding is the only way to work around this case.

Before submitting to the W3C..., posted 23 May 2004 at 04:44 UTC by piman » (Journeyer)

How is your proposal substantially different from their existing Web Fonts standard, which has been around since 1997?

Thank you!, posted 23 May 2004 at 08:46 UTC by ramoth4 » (Journeyer)

pinman: It's a bit simpler (no @font-face directive) After I posted the article, I read the W3C proposal on embedded fonts. Then I noticed the date. I looks to me as if this proposal is dead.

mslicker: you're absolutely right. The same thought process went through my head.

Everyone: Thank you for the great comments! I'm glad that there haven't been any flames (people opposed to the idea, yes, but nobody called me retarded. I'm impressed!)

And yes, Non-Latin scripts will benefit greatly from this (although I must be honest, it wasn't in the front of my mind when I was generating this proposal).

Keep the comments coming!

hail the holy cow, posted 23 May 2004 at 22:15 UTC by gilbou » (Observer)

                (oo)   hey! what about fixed size fonts !?
/-------\/ / / | || * ||----|| ^^ ^^

hail the holy cow, posted 23 May 2004 at 22:16 UTC by gilbou » (Observer)

                (oo)   hey! what about fixed size fonts !?
         /-------\/   /
        / |     ||
       *  ||----||
          ^^    ^^

Non Latin, posted 24 May 2004 at 10:17 UTC by Malx » (Journeyer)

And what is the problem???
Just set correct character set header or http-equiv meta tag and the browser will do all you need.

This works with IE and Mozilla - both will warn me that the page require font wich I have not and will propose to install/download it automatically.

But if ever you whould do this - just consider of downloading every character glyph independently. I do not need full font (which could be up to hundreds of megabytes for all languages) just to see the word "Hello" in fancy letters.

Have you considerd server-side Font-to-SVG renderer ?

Re: Non Latin, posted 24 May 2004 at 17:06 UTC by jaldhar » (Journeyer)

What is the problem?

Web pages in Indian languages are either using non-standard character mappings (in which case you are forced to use that particular font) or more sensibly unicode in which case you need a font that has Indic glyphs but not in a half-assed way (I'm looking at you Arial Unicode) so once again you need a particular font.

IE and Mozilla may behave this way on Windows but not on Linux. (Of course IE isn't even available on Linux.) Implementing such functionality is exactly what's being discussed here. In case you don't believe me, go to this page. The page has the charset set to utf8 via a metatag. Do you get prompted to download a Gujarati font? I don't think so.

Atleast with Bitstreams technology, only the glyph outlines are downloaded so even a full Chinese font is actually not that big.

SVG maybe the answer one day but right now it is also badly supported in browsers, requires downloads etc. Plus it is more work for content providers than simply writing out text.

The focus should not be on what you or I need but the needs of typical non-technical people -- farmers, schoolchildren, even luddites with Ph.Ds like my dad -- who just want to use the web in their preferred language without jumping through hoops.

Been there - Done that - Got the T-shirt, posted 26 May 2004 at 16:59 UTC by freetype » (Master)

This topic has been heavily discussed by the W3 "web fonts" discussion group, several years ago. I think what you propose won't work for several reasons:

  • Sending a copy, or even a subset or "conversion", of a font file over the Internet is a legal nightmare, unless you have the right to do so; which means that you're either the font designer, or did buy a very pricey license to do so (hint: default font licenses do not allow for this use, and font designers simply don't want to see Internet-wide distribution of their work).

  • font technology vendors (including Apple, Microsoft, Bitstream, Adobe, Afga) don't want a single web font format standard that would reduce the value of their offering, unless they control it. Good luck in making these all agree, and not encumber everything with patents...

  • finally, people don't care !! Witness the "grandiose" successes of Bitstream TrueDoc (included in nearly all versions of Netscape) and Microsoft's WEFT. I couldn't find a single page in my surfing life that used them, except for their demo sites. I'm sure there are some fanatic graphics designer somewhere that are quite fond of them, but I couldn't find them. Not that I really cared.


Bitstream holds several patents regarding the PFR format. One of them prevents you from creating PFR fonts, which means you're forced to buy their expensive tools to produce them.

Even if the format is published, this doesn't mean that it's open. For example, I can read and render PFR fonts, as long as I do not follow the hinting algorithm described in this other patent. But I can't create them. This is not a big problem, because the format really stinks. It's only appeal is that it's reasonnably compact, but so is CFF/Type 2.

Re: Re: Non Latin, posted 26 May 2004 at 23:12 UTC by Malx » (Journeyer)

Ok I see :(
The page you specify is not found, but I found otherone. There is "indic.ttf" specified as a meta-keyword :(

I got no warning, becouse it is utf-8, not some known-unstandart charset. Why not just to register this charset and set a name for it?

How many patches?, posted 10 Jul 2004 at 19:08 UTC by Ankh » (Master)

Lots of people want to carry on using and extending and patching HTML 4.

PFR and WEFT failed for a number of reasons -

  • at the time there wasn't widespread understanding of why putting text in image was a bad idea (searching, accessibility, translation, screen size, ... .. ... )
  • each format originally worked only in one browser.
  • the Bitstream approach had serious quality problems, some of which stemmed from how the code was incorporated into Netscape and some of which is due to the way the technology renders a font at high resolution and then does curve-fitting, in order to avoid a legal problem; this throws away true-type delta hints which add or subtract detail at different resolutions.
  • typeface designs aren't subject to copyright in the US, although they are copyright pretty much everywhere else. On the other hand, typeface names are often trademarked and guarded jealously. The implementation of a typeface design, the actual code, can be patented, possibly copyrighted, and in any case is usally sold subject to a license that forbids this sort of use. I'll add that respecting licenses of all kinds is a major part of the GNU philosophy: if you don't like a license, your choice is to come up with a Free replacement, not to break the license
  • slow dialup links and low resolution monitors of ten years ago meant that downloaded fonts were restricted to headings and specialty uses like mathematical symbols
  • there's no fallback mechanism. E.g. if I use a font for mathematical symbols not in Unicode, and the font isn't downloaded, what does the user see? For this reason, images are more popular except in places like large corporate desktops or university library computers, where a journal publisher can supply fonts
  • In most cases it's more effective to tell people to buy the right fonts and install them

I did make a sample page which should work both in IE and Netscape 4, although you may get square boxes in some places in IE, I think, instead of letters. It's not really all that good an example. What I'd like to do is beyond the typographic level of any major current HTML browser.

What the heck is the point of downloading fonts for my Web page when I don't have sufficient control over line spacing, min/max line length, font size, when I can't specify full justification with min and max word spacing and a proper alignment zone, when even setting punctuation properly (hung into the margins, thin spaces where needed) involves major markup, where small caps don't use the small caps font variant and I can't reliably tell them to, where . . . </rant>

In other words, the right question to ask, I think, is do we add one more spice to a soup that's so full of salt you can't taste the meat, or do we start afresh? Or to be a little nicer, since in fact I've been a fan of HTML since the Spring of 1993 when I first saw Mosaic, the question to ask is what level of typographical sophistication can we reasonably expect from today's Web browsers, and what are the likely paths forward for improving the situation.

We need more links between Web, Open Source + Free/libre, and Graphic Design communities. This means programmers understanding that graphic design isn't just arbitrary, but involves the human visual perception system and has many parameters than can be measured and taht are an important part of communication. (I gave a talk on typography from this angle at GUADEC this year if you are interested) ... It involves graphic designers and font designers understanding that not all programmers are evil scientists who don't want to pay for anything and build huge systems like Usenet solely for the purpose of font piracy (I am paraphrasing a font designer here, by the way). In other words, we need to build and foster mutual respect and understanding.

There's only so much we can do in the standards communities. We (I work at W3C) can publish CSS and HTML specs until we're blue in the logo, but we can't force people to go out and implement them. Rendering buggy HTML pages, like Windows running old MS-DOS programs, is a huge financial and technical burden. At one time no-one wanted a PC that didn't run Lotus 1-2-3. Now no-one wants a Web browser that won't display popular Web pages such as those of CNN or Disney.

There's some hope that SVG may provide a way forward, and indeed SVG allows embedded fonts, but not with enough rights management and, in native SVG format, currently without hinting that's often necessary for the screen or for printing at under 20 or 30 pixels per em.

It's possible that a future browser based on SVG and XSL/FO might be a good way forward. Heck, the latest SVG draft even allows multi-way links, so you can click a link and get a pop-up of choices. Forward into the 1960s :-)

In the meantime, let's focus on getting good Unicode script coverage Freely available, and on improving the typographic infrastructure on Linx, working with the graphic design world instead of appearing to be either ignorant of it or fighting it. Some good signs include programs like passepartout, the fontconfig work, better text support in Gimp. We have a long way to go.

w3c, posted 10 Jul 2004 at 22:12 UTC by mslicker » (Journeyer)

Why are fonts considered worthy of protection but text and images not? Why does the w3c consider the rights of font-designers much more important in comparison with the rights of authors and the rights of photographers? The only possible explanation is a bias of the membership of the w3c.

In all honesty, I am not a fan of the work of the w3c. I'm appalled reading some of the specifications, here is one striking example I have noted from HTML:

For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding.

Apparently, something as simple as determining the character encoding of a document is a very arduous and not even certain task.

I would much rather pick up where Knuth left off. Knuth has done excellent work in the field of computer typography before the w3c ever stepped foot in the arena. A new system might actually fulfil the w3c goal of interoperability. An interoperable specification should not be dependent on Linux or any other system or set of libraries. For those not aware, this is the stated goal and first sentence contained on the w3c web site:

The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential.

W3C, posted 13 Jul 2004 at 21:02 UTC by Ankh » (Master)

mslicker - I'm not trying to hold up everything we (W3C) do as perfect. It isn't. On the other hand I was involved in IETF efforts to standardise HTML. Those efforts did result in an RFC, but failed in that neither Microsoft nor Netscape were particularly involved.

Yes, the HTTP default charcter encoding is a problem; that work was done jointly between W3C and IETF, was hampered by backward compatibility problems, and actually turns out not to be simple. That you can find a problem with a specification in very heavy use by hundreds of millions of people does not make the specification useless. Life isn't perfect, and sometimes imperfect specifications are the ones that are adopted.

Text and images are worthy of protection - I didn't say otherwise. In my comment, I was trying to explain why I felt downloadable fonts had not caught on, and IPR of text and images isn't (I think) a major factor there.

I'm sorry if the fact I talked about several issues that are only loosely relaetd was confusing. Improving typographical infrastructure on the Linux (or GNU/Linux[tm]) desktop is a separate issue from the future of W3C specifications.


W3C, posted 14 Jul 2004 at 02:12 UTC by mslicker » (Journeyer)

The example I pointed out does not invalidate the work of the W3C, but for me it does arouse suspicion. In the document referenced in a post above (web fonts), we are given bulky descriptions of fonts and an elaborate font matching algorithm for which processing is likely rival the complexity of processing actual fonts. I don't think it is a coincidence that there exists so few browsers that actually comply in large part with the W3C standards. A browser should be a commodity, but today it is sold as a platform.

I would like to create new specification which is free, clean, efficient to transmit and process, simple to implement, and addresses the needs of the web feature wise. This combination would help to promote adoption through independent implementation and compatibility. A major difference from the W3C approach, is that I would specify an actual programming language. The declarative nature of HTML means that every tag must already be understood by the browser and therefore previously specified and implemented. I would specify a kernel of typographic and interactive primitives would could be extended. One language could replace several (SGML, HTML, JavaScript, CSS), and would be much simpler and more powerful. Although I would specify a programming language, in practice the program would take the form of markup in a similar fashion to HTML, LaTeX, ect. The language would be stored and transmitted well-formed and unambiguous, so there would be no wrong markup like that which can be found presently on major and minor sites alike. Existing markup as found on the web could be translated and mapped to this new kernel.

I doubt the W3C would be supportive of this, so it will be an independent effort. It could benefit from others with a keen perspective on computer typography, writing systems of the world, accessibility, and computer language design.

A single language, posted 19 Jul 2004 at 19:37 UTC by Ankh » (Master)

One problem with rewriting the Web in a programming languaeg is then authors have to be programmers. Another problem is that right now, XHTML documents, and XML docuemnts in general, can be processed and presented in a number of different ways.

"Wrong markup" comes partly because the Web is simple enough that most people are able to make a Web page. It also comes from an early decision, pre-W3C, that browsers should be "tolerant in what they accept", a misapplication of an IETF mantra. Today's Web bowsers include what amount to expert systems to try to parse HTML and display pages in ways that are bug-for-bug compatible with Netscape 4 and IE 3.

A new specification would start out with no users, though. And it seems a shame to throw away all the research and investment that's been done in the areas of XML and RDF and the Semantic Web, too.

The reason that Web Fonts are not widely used (and they are widely implemented, in the sense that IE 4 and late support them) is not about complexity of the spec. We (W3C) can't force people to implement anything at all. What goes into drafts (and the reference you posted is to a draft) are things that people on the Working Group say they will implement, or that user communities say they need. The final spec is always a compromise between these two communities - the implementors and the users - and the reason that W3C has public review processes is to try and make that compromise as fairly and usefully as possible.

We'd like something cleaner too, although you and I might not agree on what is "clean" of course. But getting non-sucky implementations from major vendors is pretty tricky, and takes a lot of time and energy. It's no use saying, "if i build it and it's popular everyone will implement it" because the truth is, if you build it and it's popular, people will implement it badly and incompatibly until it's standardised, and by that point it'll be too late.

As for interest in typography -- although there are maybe a couple of dozen or more implementations of XSL/FO, I don't know of any that are in a Web browser. I don't see Web browsers trying to push typography beyond the level of IE 5 or so. So it's difficult to see how to move forward.


a single language, posted 21 Jul 2004 at 02:40 UTC by mslicker » (Journeyer)

For any web language the authors must know the semantics (formally or informally), whether that language is fixed or extensible. HTML is a programming language but of a very limited kind, and it is not uncommon for people trained as graphic designers to learn JavaScript which is closer to a traditional programming language. That the browser does not throw up its hands at the slightest error probably makes HTML easier for non-programmers. A language could be made which is well defined for any possible input. Although no errors would be reported possible mistakes could be indicated.

An extensible programming language could be processed like XML documents. For any command there its definition and its actual use in the document. The definitions could be separated and redefined yielding a different effect. I like this aspect of CSS, the same web page can be presented in a variety of ways with no change to the main content.

As for XML, RDF, and the Semantic Web, although I lack a thorough knowledge, I see these as ancillary. The stated goal of the Semantic Web is "to create a universal medium for the exchange of data". This goal has for the most part been achieved quite elegantly by the binary numbering system. What is the motivation to suddenly implement all protocols in terms of XML? The benefits are not obvious to me.

Is the kind of success Knuth had with his TeX system repeatable? The system "froze" with the publication of TeX82, the only changes since have been bug fixes. Over twenty years have passed and it still finds heavy use in technical and scientific publications providing excellent quality documents with the same TeX82 core. In comparison with HTML it is somewhat more difficult to use, but perhaps that problem can be addressed.

I've glanced at XSL/FO. Looking at the index it seems they are attempting to enumerate a variety of typographic parameters, I don't know it is complete system or if this even a desirable approach to general typesetting. My study so far has concentrated on the TeX book, the TeX implementation, and existing web/internet standards. In addition, I would like to make note of this article which has made me somewhat skeptical of the unicode effort.

The Good, The Bad, And The Ugly, posted 21 Jul 2004 at 08:47 UTC by tk » (Observer)

Perhaps Unicode is flawed, but much of the Unicode Revisited article you linked to is pure apologetic. E.g. why's it that when TRON uses a 32-bit encoding it's called "limitlessly extensible", but when Unicode uses a 32-bit encoding it's called "inefficient"? My only beef with UTF-8 at the moment is that it needs 3 bytes per kanji, while `legacy' encodings only need 2.

(I may rant more about this Unicode and kanji stuff in my diary, when I get to it.)

Besides, using Unicode fonts doesn't mean that I can only read documents encoded in UTF-*. For example, with the Bitstream Cyberbit font, I can already read web pages encoded in GB2312, Big5, EUC-JP, Shift-JIS, EUC-KR, ... The problem is that, whatever the character set, one still needs the font just to make the text intelligible (unless it's in ASCII).

Incidentally, the original TeX has been extended upon: the e-TeX mode (elatex), for example, was created when it was found that Knuth's TeX couldn't typeset right-to-left scripts.

character sets, posted 23 Jul 2004 at 03:00 UTC by mslicker » (Journeyer)

Is the TRON encoding online somewhere? I'd like to take a look at it.

Code switching seems to make more sense on the face of it. For example, this entire page is in English, if someone sudenly posted a message in Chinese, you could switch to that code.

Here is simple format I have thought of for multilingual text (EBNF):

text   -> string*
string -> id character* null

id would be a code indentifying the character set, and perhaps the size of the characters in bytes. The size would allow you to skip the string if you don't know how to process the language, but there are other ways to handle that. This seems like a nice decentralized technique for including different encodings.

Actually ArabTeX does right-to-left typesetting in PlainTeX, I don't know how it differs from e-TeX. I will look into these extensions.

Re: character sets, posted 23 Jul 2004 at 09:47 UTC by tk » (Observer)

mslicker: This is in fact the gist of ISO 2022. Current programs tend to implement only the ISO-2022-* subsets, however.

charset switching, posted 28 Jul 2004 at 00:07 UTC by Ankh » (Master)

Note that with charset switching, yuo can only process a file starting at the beginning and looking at every byte. You can't, for example, use boyer-moore-style seach algorithms, because if you skip a codeset change mark you misinterpret part or all of the rest of the document. Another problem with switching codes is if the target codeset isn't clearly identified. This is a problem that can happen in practice with ISO2022, so that if you mix (say) Japanese, English, Greek and Hebrew, you may need some extra mechanism to communicate the secondary character sets. Note also that 8-bits isn't enough for all of Greek if you have precomposed characters.

Although there were deep misgivings in Japan, part of that is because there were already Japanese-only solutions in place, just as the USA had US-ASCII for a long time.

Unicode isn't perfect. Standards are all about comprmise, though, and Unicode+XML is the best approach we have overall for multilingual documents. The XML structure provides places to mark intended language (and sometimes script) which allos software to disambiguate unified characters. On the subject of surrogates, Unicode is strictly speaking a 32-bit character set, not a 16-bit with surrogates. The surrogates happened because (as I understand it) of some major implementations that incorrectly used 16-bit characters.

TeX does mediocre typesetting compared to the best of the high-end packages or the best hand-work, but it's well above average for student or academic work. Of course, it's also leagues beyond most or all the Web browsers I've seen :-)


charset switching, posted 29 Jul 2004 at 02:53 UTC by mslicker » (Journeyer)

Any variable length encoded character set will encounter that particular problem with the boyer-moore search, this includes utf-8. If random access is an important property, an index can be created. Mostly text is processed in sequence. The format I proposed would support boyer-moore searching if I made the strings counted, it might operate faster in some cases with its clearly marked sections. Why should all characters be put into one gigantic set? The input methods are different for each language, the fonts are designed separately, and from the perspective of information theory, characters occur largely in monolingual sequences.

When a group decides to proactively create standards, they take on the role of designers. If standardization is design then we need competing designs, and objective measures, instead of assigning authority to a group of people who may or may not be capable designers. Compromise is not a good excuse when these standards are so recent and have minimal backward compatibility. Protocol designers should strive for even greater excellence, since their work affects the most people.

If unicode needs XML (or another markup) to distinguish characters, then it can not be considered a complete solution on its own. That seems to me a major failure for something the claims to provide a "unique number for every character, [...] no matter what the language".

Knuth has fooled me with his TeX system. I actually own several books typeset in TeX, I did not realize this until I started checking the covers. There have been microtypographic extensions to TeX (pdfTeX) including hanging punctuation. I wonder if the returns are worth the investment for some of the extensions. I would like to leave that decision to the author, what resources are devoted to laying out the page. The difficult part is choosing the right primitives, given the layout is already optimized at the paragraph level or perhaps higher.

I like that you bring up the quality of TeX, maybe that illustrates the difference that a trained eye can make in what otherwise could be considered acceptable, even high quality, typography. I think the same kinds of differences exist in computing, perhaps the differences are far larger between the best work and what is often considered passable today.

Re: charset switching, posted 29 Jul 2004 at 05:14 UTC by tk » (Observer)

mslicker: UTF-8 happens to be immune to the problem of misinterpretation. In a multi-byte sequence, the first byte is always in the range 0xc0-0xff, and the second and subsequent bytes are always in a different range 0x80-0xbf.

Big5 is also quite close: the first byte always has the high bit set, while the second byte may or may not.

This isn't true with, say, GB2312, which uses 2 bytes per Chinese character where both byte have the high bit set. If the machine starts interpreting from a second byte of a pair, the entire passage becomes hash.

But I wonder if the ability to search randomly is so important in deciding a format for external data exchange, or even whether a format is suited for Boyer-Moore search from within a text editor. Even for English text, if only paragraph breaks are marked but not line breaks, a program must pretty much scan through arbitrary amounts of text to decide how to format a section of it on screen.

Why flog this particular dead horse?!?, posted 21 Sep 2004 at 23:17 UTC by MartySchrader » (Journeyer)


Please, folks -- not more concentration of effort in the area of presentation to the detriment of content. I am in complete agreement with those who stated above that the look of a particular font is much less important than the overall presentation of information. Downloadable fonts or whatever is just taking it a bit too far.

I use Netscrape Nagigator 7.1, Mozilla Firefox 0.8, Microschlock Internet Extorter 6.0, and Oprah! 7.1 to test my web pages. (Sorry, no Konqueror or Safari -- yet. Stay tuned.) IE, Opera, and the two Gecko viewers render things radically differently. There are differences between Navigator and Firefox as well. Am I to throw yet another variable into the mix to see how that is going to look across browsers? Why, for goodness sakes?!?

I do not wish to dismiss the discussion of font control with a cavalier wave of the hand. Nor do I wish to become bogged down examining the technical virtues of Yet Another HTML Extension. Can't we focus more on the overall page presentation aspects of CSS and worry about font esoterica later? Hey, let's figure out how to make pages without using the table entity, then I'll talk about fonts all day long.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page