Precomposed Characters

The concept and practice of "precomposed characters" merits a separate page because these can cause much grief to font designers who deal with positioning and substiution.

Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ". (The Unicode® Standard: A Technical Introduction. Character Sequences)

The idea sounds harmless. A symbol like é can be written with the single character 00E9 or with the sequence of e 0065 and combining acute accent 0301. The intent is to provide some backwards compatibility with older encodings. In some cases, software needs to be able to decompose composites for processing, such as for alphabetization, so The Unicode Standard defines the decompositions for all precomposed characters. There are no plans to increase the number of precomposed characters, and it looks to be almost impossible to expect new combinations of Latin letters with diacritics (see Unicode FAQ. Characters, Combining Marks, 13). There is little real need to add new precomposed characters. This would add more work in decomposition. The number of possible new combinations would be enormous, especially with several diacritics combined at once on a base letter.

The trouble for font developers does not come from decomposition, but from precompostion. On both Mac and Windows there is built in software (Windows Uniscribe and Macintosh Cocoa) which converts a sequence of characters to a corresponding precomposed character if there is a glyph for it available in the font. That is, when the user enters e 0065 and combining acute 0301, the system will display the glyph for é 00E9. This kind of logic is useful for well known languages or languages with few diacritics, and is useful on systems which do not yet work with font substitution and positioning. For instance, there are precomposed characters for all or close to all European languages and Vietnamese (for the latter see 1EA0 to 1EF9—90 extra characters). Users of these languages would likely be content with a Unicode font containing all their characters in precomposed form.

There would also theoretically be no difficulty with precompostion for languages using diacritics for which no precomposed forms occur, if any such languages exist. In those cases, either the system properly positions all the diacritics or it postions none. The difficulty with precompostion arises with those languages, for which some but not all of the desired combinations of base character and diacritic exist in precomposed form. This is the case with all Yukon languages and probably with many other diacriticized roman languages around the world. An example of the uneven distribution of precomposed characters relates to the diacritic ogonek, "little tail" in Polish.

In Yukon languages, and many others, ogonek shows nasalization of a vowel. In Polish the ogonek appears only on a,A,e,E and also shows nasalization. In Lithuanian the ogonek appears on a,A,e,E,i,I,u,U. At one time it showed nasalization but now partly shows vowel length. Neither European language has ogonek on o,O. Unicode defines precomposed ogonek forms for a,A,e,E,i,I,u,U but not for O,o. A font may contain all, some or none of the precomposed characters. The chart below compares the effects of precompostion with regard to the diacritic ogonek for several fonts displayed in TextEdit on Macintosh OS X 10.2. The text was entered once, and the font changed for each screen shot.

Apple OS X 10.2 core font: Helvetica Neue
No precomposed bases with ogonek. All ogoneks centred below the base character by Cocoa.

Apple OS X 10.2 core font: Lucida Grande
Precomposed a,e,i,u,A,E,I,U. Only o,O have ogonek centred below the base by Cocoa.With o,O the ogonek is slightly lower.
Apple OS X 10.2 core font: Times
Same as Lucida Grande. The difference in ogonek position with o,O is very noticeable.
Apple OS X 10.2 core font: Herculanum
Just a,e,i,A,E,I are precomposed. Note that the shape of the ogonek differs between the precomposed forms, and that with o,O,u,U. This font could be used for Polish, but would not be suitable for Lithuanian or Yukon languages.
Microsoft font: Lucida Sans Unicode
Just o,O are not precomposed but the shapes and positions of ogonek with these characters are visually consistent with the others (at least on screen, at this point size, with this operating system).
YNLCserif
No precomposed bases with ogonek, like Helvetica Neue. But unlike with that font, the diacritic's vertical position varies, due to the logic in Cocoa. In this image the ogonek is one pixel higher with i,o,A,E,I,O than it is with a,e,u,U.

An inconsistent inventory of precomposed signs can lead to inconsistencies with display as shown above. But there is a second problem with this diacritic. In Polish and Lithuanian typography there are particular rules about the ogonek's shape and positioning. For instance, with a,u the ogonek is attached to the stem of the character. This contrasts with the Yukon practice of placing the diacritic under the belly of a,u. While developing the GPOS tables in YNLCserif I was baffled by the fact that the ogoneks were properly placed when viewed in MS VOLT, but sometimes not when viewed in Babelpad or Internet Explorer. Worse, the positioning appeared to correct itself to the expected form when certain other diacritics were added to the base character. Paul Nelson from Microsoft kindly pointed out the fact that Uniscribe defaulted to the precomposed forms where it could. That is, the ogonek would appear either on the stem or the belly of an a or u depending on whether or not Uniscribe could apply a precomposed character.

The solution chosen was to remove the Unicode encoding from the offending glyphs, but leave the glyphs in the font. The glyphs are not accessible by Uniscribe or Cocoa but are available in case needed in the future. The most elegant solution would be to eliminate all the precomposed characters in the font. This would mean a dramatic decrease in font size and prevent future unwanted precompostion.

Removing the encoding for a,A,e,E,i,I,u,U with ogonek fixed one problem with precompostion, but there are others, including some which may only surface when the font is in actual use. Another precomposition problem noticed on Windows involved the eight glyphs for e,E,o,O with macron and grave or macron and acute. Unencoding these seemed to fix all the problems noticed on Windows in the text samples used for testing. Then, on Mac, a further problem was seen. In Yukon practice a macron is used below certain base characters to show semi-voiced fricatives in Upper Tanana and Tanacross (s, sh, th, xh with "underscore"). Of the letters involved, h,s,t,x,H,S,T,X with macron below, only h,t,T have precomposed forms in Unicode (T 1E6E, t 1E6F, h 1E96). The chart below contains images from TextEdit on OS X 10.2 with 48 pt fonts showing the needed symbols. The precomposed characters can cause trouble with the size and placement of the diacritic as indicated. Also, some placements do not match because Cocoa apparently calculates positions based on character outline, rather than baseline.

Apple OS X 10.2 core font: Lucida Grande
The macron in precomposed h,t,T appears higher than the ones placed by Cocoa on s,x,H,S,X. The precomposed macrons are also wider.

Freeware font from Herman Miller: Thyromanes
The macron on precomposed h,t,T appears lower than the ones placed by Cocoa on s,x,H,S,X.

Microsoft font: Lucida Sans Unicode
No precomposed characters. All macrons placed by Cocoa. The macrons below glyphs with curved bottoms, s,t,S, are lower than those below flat-bottomed glyphs.
YNLCserif before uncoding the precomposed glyphs. Macrons are at three heights, highest under precomposed h,t,x, mid under flat-bottomed x,H,X, and lowest under round-bottomed s,S.
YNLCserif after uncoding the precomposed h,t,T with macrons below. The macron under round-bottomed S is placed slightly lower by Cocoa. At 96 points, the difference in macron placement by Cocoa with s,t,S is visible.
In theory, one could create a visually coherent system involving precomposed and non precomposed characters in a round about fashion. One could tinker with the precomposed characters in font software until the diacritic shape and placement matches that used by Cocoa on OS X. Then, one could tinker with MS Volt on Windows to make the combining diacritics match the ones established in the precomposed glyphs in the font. That is, the placement in Cocoa defines the placement in VOLT. Few typographers with good design sense would adopt this system. It makes much more sense to uncode or eliminate the precomposed glyphs. This allows the development of visually appealing postioning and substitution on Windows with MS VOLT. It also allows positioning in Cocoa which, although perhaps less visually appealing than the Windows version, is more internally consistent than a font with precomposed characters.

Last Updated 23 October 2003