Languages, Czech, Croatian, Danish, Dutch, English (International), English (U.S.), Finnish, French, German, Greek, Hungarian (hyphenation only), Italian, Norwegian, Polish, Brazilian Portuguese, European Portuguese, Russian, Slovak, Spanish, Swedish, Swiss-German, Turkish
ISO 639 is a standardized nomenclature used to classify languages. Each language is assigned a two-letter (639-1) and three-letter (639-2 and 639-3), lowercase abbreviation, amended in later versions of the nomenclature. The system is highly useful for linguists and ethnographers to categorize the languages spoken on a regional basis, and to compute analysis in the field of lexicostatistics. ISO 639 has five code lists.
Unicode (UCS-2 ISO 10646) is a 16-bit character encoding that contains all of the characters (216 = 65,536 different characters total) in common use in the world's major languages, including Vietnamese. The Universal Character Set provides an unambiguous representation of text across a range of scripts, languages and platforms. It provides a unique number, called a code point (or scalar value), for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode standard is modeled on the ASCII character set. Since ASCII's 7-bit character size is inadequate to handle multilingual text, the Unicode Consortium adopted a 16-bit architecture which extends the benefits of ASCII to multilingual text.
Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility. Computer programs that use Unicode character encoding to represent characters but do not display or print text can (for the most part) remain unaltered when new scripts or characters are introduced.
At present, a number of countries, like China, Korea, and Japan, have adopted Unicode as their national standards, sometimes after adding additional annexes with cross-references to older national standards and specifications of various national implementation subsets.
In September 2001, Vietnam's Ministry of Science, Technology and Environment (MOSTE) issued the TCVN 6909:2001 standard, which is based on ISO/ICE 10646 and Unicode 3.1, as the new national standard for Vietnamese 16-bit character encoding.
What is UTF-8?
The Unicode Standard (ISO 10646) defines a 16-bit universal character set which encompasses most of the world's writing systems. 16-bit characters, however, are not compatible with many current applications and protocols that assume 8-bit characters (such as the Web) or even 7-bit characters (such as mail), and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. Unicode provides for a byte-oriented encoding called UTF-8 that has been designed for ease of use with existing ASCII-based systems. UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a unique sequence of one to four bytes. The UTF-8 encoding allows Unicode to be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. It was introduced to provide an ASCII backwards compatible multi-byte encoding.
The Unicode UTF-8 format of ISO 10646 is the preferred default character encoding for internationalization of Internet application protocols. It will be most common on the world wide web. Being multiple-byte format, it is naturally fit for the web as the web itself is based on 8-bit protocols. UTF-8, in fact, is the only Unicode format that is commonly supported by web browsers.
In October 2006, the Unicode Consortium released the newest version of the standard, Version 5.0. The new version contains a lot of characters- just under 100,000.
Two things set the Unicode standard apart from other character encoding standards. One is the sheer size and the comprehensiveness of its code assignments. Those 100,000 characters assignments cover all of the characters in all of the writing systems for all the languages in common business use today, as well as all the characters needed for many minority languages and obsolete writing systems, and whole host of mathematical, scientific and technical symbols. Whatever the character you need, the changes are overwhelming that Unicode has it, and if it doesn't, no other encoding standard in reasonably wide use is going to have it, either. This comprehensiveness makes it possible to represent text in any language or combination of languages without having to worry about specifying which character encoding standard your application or document is following- and without having to worry about changing that encoding standard in the middle of your document or going without characters because you can't change encoding.
Naturally, this comprehensiveness poses implementation challenges that have to be addressed. For example, many of the world's writing systems have complicated two-dimensional ordering properties that don't map well onto a linear progression of numeric codes, and many can be analyzed into "characters" in different ways. Different encoding decisions may need to be made for different scripts, yet you still have to be able to mix them in a document and have things work sensibly. Many characters have similar appearances, leading to potential security issues that have to be addressed. You also can't infer much about a character from its position in the code space or its appearance in the code charts. There are too many characters for that, with more being added all the time.
Because of these and many other issues, the Unicode standard and its accompanying Unicode Standard Annexes ("UAXes" for short) go far beyond any other character encoding standard in describing just how those 100,000 character assignments get used together to represent real text and how software should carry out various processes on the characters. For example, since you can't infer things from a character's position in the encoding space, the standard includes a very large database of character properties that lay out in tremendous detail such things as whether a character is a letter or a digit, which other character (if any) it's equivalent to and so on. Because there are more character codes than can be represented in a single 16-bit word, the standard defines different representation schemes (Unicode calls them "encoding forms") that optimize for different situations. Because Unicode allows many characters to be represented in more than one way, the standard defines processes for dealing with the equivalences. Many other complexities and challenges are also addressed in the standard.
Any text document consists of content and layout. The document translation process aims at recreating a document in the target language that is equivalent to the source document in both content and layout. Thus, the document translation process has two main sub processes: content translation and layout adjustment. Content translation must be - and since this is apparent to most people, it generally is - performed by native speakers of the target language.
The situation is different in the case of layout adjustments. Modem translation tools are so good at extracting the translatable text portions from source documents while protecting the non-translatable formatting elements those layout adjustments may not even be needed. This is typically the case for the translation of web formats such as HTML or XML. Since web layout is rather fluid, with a large part of the actual presentation controlled by the web browser, it is generally sufficient to simply replace the source with the target text. If the goal is to produce translated print documents, however, the translated text often has to be forced into a predetermined, fixed layout. Due to time constraints, cost considerations or other logistic factors, desktop publishers often find themselves confronted with the task of touching up a document of which they are unable to read a single word.
Although one may deplore this situation as a violation of best practice, it is nevertheless common enough to warrant treatment as an integral part of translation process. As such, it requires support material to help non-readers in their layout adjustment task.
In this issue, we will look at Japanese. The first concern a desktop publisher may have is text directionality. As many people know, Japanese books are traditionally read from right to left, in a top-to-bottom column format, but scientific and technical publications, including user manuals for hardware and software, are always written left to right in the same format as English documents. The web appears to be spreading this format still further. Thus, when English - language technical documentation is translated into Japanese, the source text should simply be replaced with Japanese, and the document layout should stay as is.
When space is tight in print documentation, it is often necessary in adjust the line breaks manually. For non - reader, Japanese text appears daunting at first glance, since words are often not separated by spaces. However, written Japanese has number of surface characters that can provide useful guidance.
First Japanese use punctuation marks to delimit sentences [period], sub clauses (comma) and insertions [parentheses]. Thus, just as in English, it is always safe in insert a line break after a period, or a comma or a closing parenthesis, or before an opening parenthesis. When foreign words are transcribed into Japanese script, spaces are indicated either with the ' character or a one - byte space. Inserting a lime break immediately after this dot character or the space is acceptable.
The Japanese writing system uses three different sets of characters, each one for a specific purpose. Chinese characters called kanji are used to convey concepts or word meaning: they are logographic symbols. Thus, kanji carry the main meaning of Japanese texts, Kanji are fairly easy to recognize, since most of these symbols looks fairly intricate. Since Japanese uses many hundred kanji, a complete listing is impractical.
Hiragana are symbols of Japanese origin that form a syllabary. This means that, like English letters, each symbol stands for a speech sound rather than a word meaning. However, while English letters generally represent a single sound, hiragana represent a whole syllable.
Hiragana are used to represent grammatical information-that is, they roughly correspond in English prepositions, conjunctions and similar function words. Hiraganas are generally attached at the end of a word - that is, hiragana typically from a unit with preceding kanji.
Katakana are used for transcribing foreign words and names. In some cases, as in product names or entire names, Japanese also uses Western script, and Arabic numbers are commonly used in Japanese just as in English.
Since these fairly easily distinguishable symbols sets are used for such different purposes, it is possible to make some useful generalizations for basic layout adjustments.