What is the UNICODE Standard, Version 5.0

Languages Frequently asked questions

What is the UNICODE Standard, Version 5.0?

In Unicode's fifteen-year history, if has become the character encoding standard of choice in new applications. It's the default encoding of HTML and XML; it's the fundamental character type in programming language such as Java, C and JavaScript; and it's the internal character encoding in the Windows and Macintosh operating systems. Virtually all UNIX flavors include support for it, too. Unicode is to computing in the twenty-first century what ASCII was to computing in the twentieth century.

In October 2006, the Unicode Consortium released the newest version of the standard, Version 5.0. The new version contains a lot of characters- just under 100,000.

Two things set the Unicode standard apart from other character encoding standards. One is the sheer size and the comprehensiveness of its code assignments. Those 100,000 characters assignments cover all of the characters in all of the writing systems for all the languages in common business use today, as well as all the characters needed for many minority languages and obsolete writing systems, and whole host of mathematical, scientific and technical symbols. Whatever the character you need, the changes are overwhelming that Unicode has it, and if it doesn't, no other encoding standard in reasonably wide use is going to have it, either. This comprehensiveness makes it possible to represent text in any language or combination of languages without having to worry about specifying which character encoding standard your application or document is following- and without having to worry about changing that encoding standard in the middle of your document or going without characters because you can't change encoding.

Naturally, this comprehensiveness poses implementation challenges that have to be addressed. For example, many of the world's writing systems have complicated two-dimensional ordering properties that don't map well onto a linear progression of numeric codes, and many can be analyzed into "characters" in different ways. Different encoding decisions may need to be made for different scripts, yet you still have to be able to mix them in a document and have things work sensibly. Many characters have similar appearances, leading to potential security issues that have to be addressed. You also can't infer much about a character from its position in the code space or its appearance in the code charts. There are too many characters for that, with more being added all the time.

Because of these and many other issues, the Unicode standard and its accompanying Unicode Standard Annexes ("UAXes" for short) go far beyond any other character encoding standard in describing just how those 100,000 character assignments get used together to represent real text and how software should carry out various processes on the characters. For example, since you can't infer things from a character's position in the encoding space, the standard includes a very large database of character properties that lay out in tremendous detail such things as whether a character is a letter or a digit, which other character (if any) it's equivalent to and so on. Because there are more character codes than can be represented in a single 16-bit word, the standard defines different representation schemes (Unicode calls them "encoding forms") that optimize for different situations. Because Unicode allows many characters to be represented in more than one way, the standard defines processes for dealing with the equivalences. Many other complexities and challenges are also addressed in the standard.


Comments