What is UNICODE

Languages FAQ

What is UNICODE?

Unicode (UCS-2 ISO 10646) is a 16-bit character encoding that contains all of the characters (216 = 65,536 different characters total) in common use in the world's major languages, including Vietnamese. The Universal Character Set provides an unambiguous representation of text across a range of scripts, languages and platforms. It provides a unique number, called a code point (or scalar value), for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode standard is modeled on the ASCII character set. Since ASCII's 7-bit character size is inadequate to handle multilingual text, the Unicode Consortium adopted a 16-bit architecture which extends the benefits of ASCII to multilingual text.

Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility. Computer programs that use Unicode character encoding to represent characters but do not display or print text can (for the most part) remain unaltered when new scripts or characters are introduced.

The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, and many others. Unicode is required by modern standards such as XML, Java, .NET, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, offers significant cost savings over the use of legacy character sets. It allows data to be transported through many different systems without corruption.

At present, a number of countries, like China, Korea, and Japan, have adopted Unicode as their national standards, sometimes after adding additional annexes with cross-references to older national standards and specifications of various national implementation subsets.

In September 2001, Vietnam's Ministry of Science, Technology and Environment (MOSTE) issued the TCVN 6909:2001 standard, which is based on ISO/ICE 10646 and Unicode 3.1, as the new national standard for Vietnamese 16-bit character encoding.

What is UTF-8?
The Unicode Standard (ISO 10646) defines a 16-bit universal character set which encompasses most of the world's writing systems. 16-bit characters, however, are not compatible with many current applications and protocols that assume 8-bit characters (such as the Web) or even 7-bit characters (such as mail), and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. Unicode provides for a byte-oriented encoding called UTF-8 that has been designed for ease of use with existing ASCII-based systems. UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a unique sequence of one to four bytes. The UTF-8 encoding allows Unicode to be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. It was introduced to provide an ASCII backwards compatible multi-byte encoding.

The Unicode UTF-8 format of ISO 10646 is the preferred default character encoding for internationalization of Internet application protocols. It will be most common on the world wide web. Being multiple-byte format, it is naturally fit for the web as the web itself is based on 8-bit protocols. UTF-8, in fact, is the only Unicode format that is commonly supported by web browsers.


Comments