Welcome to D-Type
Font Engine And Scalable Graphics Library
D-Type Logo

Zoom OutZoom In
Home
About D-Type
Font Engine
Rasterizer
PowerDoc Engine
PowerDoc For ASP/.NET
D-Type Portable Unicode Text Module
Text Layout
PowerDoc Editor
Download
Evaluate
Contact Us
FAQs



Extensions And Tools
Text Layout
A Simple Extension For Complex Scripts
Windows Mac OS Linux
Overview

NEW

Commonly used scripts such as Latin, Greek or Cyrillic are easy to display. All you need to do is render their characters in a simple linear progression from left to right and the resulting text is correctly displayed. Unfortunately, not all world's scripts are that simple. Many scripts, just to be displayed correctly, require special processing such as character reordering, contextual shaping, ligatures, positioning adjustments etc. These scripts are also known as complex scripts. Arabic, Indic and Thai are among those scripts. And even Latin scripts often use ligatures and various types of positioning adjustments (e.g. kerning) to enhance the appearance of displayed text.

The Unicode Standard alone does not help software developers with the task of laying out text. Unicode deals with the units of textual content (characters) and provides a good solution for the computer representation, storage and interchange of text. However, Unicode does not deal with the units of textual display (glyphs) and does not provide a solution to the problem of actual text layout, shaping and advanced typography. Obviously, a global, efficient and portable Unicode based text layout/shaping engine is necessary to help developers with this quite challenging task.[1]

To better understand the problems that layout/shaping engines must overcome, here are just some of the complications associated with the display of various world's scripts:

  • Arabic and Hebrew are read from right to left. Consequently, the order of characters differs in presentation from storage. Character positioning, cursor movement and text selection in bidirectional context (the context in which left-to-right and right-to-left text runs coexist) is typically the biggest challenge to overcome. The characters are not laid out in a simple linear progression from left to right. In other words, the logical order of characters (the order in which the user enters text as a sequence of keystrokes) can be different from the visual order (the order in which glyphs are represented to the user)

    Bidirectional Text
    In bidirectional text the trailing edge of one character is not necessarily adjacent to the leading edge of the next character. The above example shows one logically contiguous selection of characters (but visually disjointed).

  • Arabic scripts are not only read from right to left; they also require special processing necessary to display contextual forms properly. For example, the visual appearance of a character in Arabic scripts can change greatly depending on its position within a word and the characters that surround it. Most (but not all) characters have four different visual forms: isolated (when the character is alone), initial (beginning of a word), middle (within the word) and final (end of word).[2] This means that layout/shaping engines must not only shape those forms properly but also detect word boundaries within a given run of text.[3]

    Contextual text shaping in Arabic scripts
    The above example shows an Arabic text sample without any special processing (in which characters are in their isolated form) and then the same text sample again with contextual shaping enabled (in which characters take their proper form depending on whether they are at the beginning, in the middle or at the end of the word).

  • With Latin, Greek, Cyrillic and even Chinese/Korean/Japanese scripts, there is often a direct one-to-one mapping between a character and its glyph. However, in Arabic, Indic and other complex scripts, several characters can combine together to create a whole new glyph. These special glyphs are then called ligatures. Although Latin scripts can also make use of ligatures, most Latin ligatures are optional and designed to improve the aesthetic appearance of certain character combinations. However, in Arabic and many other complex scripts, certain ligatures are mandatory. In those cases it is unacceptable to present certain character combinations without using the appropriate ligature.

    Ligatures ir Arabic and Latin scripts
    Ligatures are not only used in complex scripts such as Arabic but sometimes in Latin scripts too. The first ligature in the above illustration is the Arabic Lam-Alef ligature which is mandatory for Arabic scripts. The remaining ligatures are some of the Latin standard and discretionary ligatures.

  • The South Asian family of scripts (Indic) exhibit rendering complications that are not found in any other script. Letters are drawn in a different order from that in which they are typed or stored in memory, glyphs are inserted or rearranged and complex ligatures are formed. The actual amount of pre-processing necessary to convert a series of Unicode Devanagari characters into a series of glyphs is huge. It should therefore come as no surprise that the Unicode Standard had to dedicate more than twelve pages describing the proper processing of Devanagari characters!

    Complex glyph rearrangements and ligatures in Indic scripts
    Contextual shaping for Indic scripts must deal with complex glyph rearrangements and ligatures.

  • The difficulty with contextual shaping is that a given character, for all of its various glyph forms, usually has only one defined code point in the Unicode Standard. Similarly, ligatures often do not have a Unicode code point.[4] It is the responsibility of the layout/shaping engine to determine, at run time depending on the context, the appropriate visual form of each character in the text.

D-Type Text Layout Extension thanks to the underlying ICU LayoutEngine solves all of these problems in a simple and straightforward way. All complex script rendering is done in a uniform and consistent manner. The application is responsible for supplying to the Text Layout Extension an array of Unicode character codes in reading or logical order while the extension returns an array of glyphs to display in the correct visual order along with the coordinates necessary to properly position those glyphs and, additionally, character indices to map each glyph back to the input text array. Then, these positioned glyphs can be very easily rendered using D-Type Font Engine.

The benefit of this approach is that software developers do not have to be familiar with various complex scripts or any of the shaping rules that might be applicable to each script. Regardless of the script, the Text Layout Extension is always utilized in the same consistent way. It is only important to be aware of the following basic concepts:

  • The Text Layout Extension, or more precisely the underlying ICU LayoutEngine, is designed to process a sequence of Unicode characters which is in a single font, script and direction. Developers can use the Unicode bidirectional algorithm built into the Text Layout Extension to determine the direction of the text or give the user direct control over bidirectional text layout.

  • The sequence of input characters is always passed to the Text Layout Extension in reading or logical order.

  • Developers should not assume a simple one-to-one mapping between input characters and output glyphs. In other words, the size of the resulting glyph array can be (and with complex scripts usually is) different than the size of the input Unicode character array.

  • When it is necessary to map output glyphs back to the initial sequence of input characters (e.g. for cursor movement and text selection), developers should use the returned array of character indices.

As mentioned above, D-Type Text Layout Extension internally relies on the ICU LayoutEngine, a popular open source portable and platform independent layout engine capable of shaping many complex Unicode scripts including Arabic, Bengali, Devanagari, Gujarati, Gurmukhi, Han, Hebrew, Kannada, Malayalam, Oriya, Tamil, Telugu and Thai. The ICU LayoutEngine uses layout tables found in font files and the knowledge of generic script shaping rules to lay out complex scripts.

The ICU LayoutEngine supports complex scripts in the following ways:

  • If the font contains OpenType tables, the LayoutEngine uses those tables.
  • If the font contains Apple Advanced Typography (AAT) tables, the LayoutEngine uses those tables.
  • For Arabic and Hebrew text, if OpenType tables are not present, the LayoutEngine uses Unicode presentation forms.
  • For Thai text, the LayoutEngine uses either the Microsoft or Apple Thai forms.

For more information about the ICU LayoutEngine, please visit the ICU LayoutEngine web site and take a look at the applicable documentation.

The ICU LayoutEngine itself, however, does not provide an interface to access the necessary layout tables in the font files. Depending on how the fonts are accessed, this interface must be written by the client (developer). In other words, the developer is responsible for opening, closing and managing the actual fonts (e.g. from file or memory), accessing and, optionally, caching their layout tables and supplying those tables to the ICU LayoutEngine when requested. In the past, this was the only way for software developers to use the ICU LayoutEngine in conjunction with D-Type Font Engine.

With D-Type Text Layout Extension, fortunately, this is no longer necessary. D-Type Text Layout Extension takes care of all the font specific tasks and interaction with the ICU LayoutEngine. Software developers can now use one simple extension to display all supported complex scripts without the need to write their own font access interfaces. D-Type Text Layout Extension is an extension of D-Type Font Engine that makes it possible to easily render complex scripts, hiding from the developer all the complexity associated with this process and the need to interface with the ICU LayoutEngine directly.

For software developers who use or plan to use D-Type rendering technology, D-Type Text Layout Extension brings the following benefits:

  • No need to access fonts. Developers don't have to manage or access the font files themselves. D-Type Text Layout Extension uses the same font IDs as D-Type Font Engine.
  • Caching of font layout tables. D-Type Text Layout Extension caches frequently used layout tables that are found in font tables so that subsequent access to the same tables is efficient and quick.
  • Caching of layout instances. D-Type Text Layout Extension caches layout instances for various complex scripts so that the same shaping rules can be applied to different text runs quickly and efficiently.
  • Small, compact, portable. The entire D-Type Text Layout Extension, which includes the latest ICU LayoutEngine, font access interfaces and the caching sub-system fits in less than 140KB of machine code.[5]
  • Easy, single package solution. All you need to render complex world's script is D-Type Font Engine and D-Type Text Layout Extension. Together, these two libraries act as a single library.

The most recent D-Type Text Layout Extension includes the ICU LayoutEngine that was released on September 17, 2007 (ICU 3.8 release). As new ICU releases become available, the Text Layout Extension will be updated to support the most recent version of the ICU LayoutEngine.

________________

[1] On Windows 2000 and XP, application developers can use the Win32 Text APIs or Uniscribe to display the complex scripts that are supported by Windows. One problem with this approach, however, is that the solution works only on Windows. Software developers who write cross-platform software have no way of porting their code to other platforms (e.g. Linux or Mac).

[2] In reality the situation is actually a little bit more complicated. Arabic is a cursive script in which letters in a word are often connected to each other. The initial form indicates that no letter is attached to the letter from the right (i.e. there is no attaching character before it, but there is one following the character). But the initial form does not necessarily mean that the character in at the beginning of a word; it only indicates that the character is not at the end of the word.

[3] Detecting word boundaries is not always a trivial task. Although most scripts use a space character as a word separator, there are scripts in which words appear without a space between them. Thai is probably the best example of such a script.

[4] There are exceptions however. For historical reasons (older software did not have contextual text shaping capabilities), the Unicode standard encodes the initial, medial, final and isolated forms of Arabic letters separately in the U+Fxxx range, called Arabic Presentation Forms. The use of such presentation forms is deprecated but not uncommon. For the same historical reasons, even certain Latin ligatures have a defined Unicode code point.

[5] The size of the extension varies depending on the platform. Additionally, the size is expected to grow as the size of the ICU LayoutEngine grows.


Copyright © 1996-2007 D-Type Solutions. All scalable images on this web site are rendered by D-Type Font Engine, D-Type Rasterizer and/or D-Type PowerDoc Engine. Reproduction, copying, or redistribution for commercial purposes of any materials, images or design elements of this website is prohibited without the prior written consent of D-Type Solutions. All trademarks are the property of their respective holders and are mentioned for identification purposes only.

Last updated on December 27, 2007