Design considerations of Bilingual encoding schemes
Posted By kalyan on April 20, 2008
Tamil script Code for Information Interchange (TSCII) and
other 8-bit Coded Bilingual Character Sets of ISO /IEC /ECMA
Introduction
In Character set standards, various slot positions (bit combinations) of the set are divided into different zones with specific features /restrictions. The specifications of the characters of the 8-bit code are as follows:
- C0 set containing control characters (also known as CL area):
A set of up to 30 control characters represented by bit combinations 00/00 to 01/15, except 00/14 and 00/15 which shall not be used.
The requirement for the C0 set are:
- bit combination 00/14 and 00/15 shall not be used; – the control character ESCAPE shall be represented by bit combination 01/11; – any control characters can be allocated to the other combinations - Character ESCAPE
Escape is a control character, represented by bit combination 01/11, used to form escape sequences. - Character SPACE
A graphic character represented by bit combination 02/00, having a visual representation consisting of the absence of a graphic symbol. It causes the active position to be advanced by one character position. - G0 set (also known as GL area)
94 bit combinations 02/01 to 07/14 are used to represent graphic characters. All graphic characters allocated to bit combinations in the range 01/07/14 are spacing characters, that is they cause the active position to advance by one character spacing. The graphic characters allocated by this standard to these 94 bit combinations are those of standard lower ASCII set - Character DELETE
A character represented by bit combination 07/15. DEL was originally used to erase or obliterate an erroneous or unwanted character in punched tape. DEL may be used for media-fill or time-fill. DEL characters may be inserted into, or removed from, a data stream without affecting the information content of that stream, but such action may affect the information layout and/or control of the equipment - C1 set (also known as CR area )
The C1 set is available for up to 32 control characters in addition to those provided by the C0 set. It shall not include any of the control characters of the C0 set of ISO 6429.
No specific control characters are allocated to bit combinations 08/00 to 08/13 and 09/00 to 09/15 by this standard.
When the single shift functions SS2 and SS3 are used, they shall be allocated to bit combinations 08/14 and 08/15 respectively, otherwise these bit combinations shall not be used. (Note: A C1 set comprising only of SS2 and SS3 to these bit combinations has been registered as ISRO-IR No. 105). - G1 set (also known as GR area)
A set of up to 96 graphic characters represented by bit combinations 10/00 to 15/15.
The G1 set shall be either a 94-character set (bit combinations 10/00 to 15/14) or a 96-character set (bit combination 10/00 to 15/15) of graphic characters. This set is available to graphic characters in addition to those provided by the G0 set.
Either a unique graphic character shall be allocated for each bit combination or the bit combination shall be declared unused.
8-bit character set Standard References
Three ISO standards deal with 8-bit character sets:
ISO/IEC 2022:1994
Character code structure and extension techniques (fourth edition).
Description
This standard specifies a structure for 7-bit and 8-bit codes that is adopted by all such codes produced under the auspices of ISO/IEC JTC1/SC2. This is the subcommittee entrusted jointly by ISO and IEC with the development of character set coding matters. This standard also specifies means by which the correspondence between bit combinations and characters may be changed during a particular instance of information interchange. This is known as code extension. It makes use of control functions that are themselves represented by bit combinations within the original code. Cf: http://www.ewos.be/tg-cs/gis2022.htm
ISO/IEC 4873:1991
ISO 8-bit code for information interchange – Structure and rules for implementation (third edition).
Description
This standard specifies a structure for 8-bit codes that builds on the general structure for such codes laid down in ISO/IEC 2022. In particular the content of the GL area of the code table is fully specified and the content of the GR area is restricted to be a character set that makes use of single-byte coding (and so contains at most 96 characters). The fixed content for the GL area is the set registered in the ISO 2375 Register as ISO-IR 6. This set is also the International Reference Version (IRV) of ISO/IEC 646:1991 and is more commonly known as the ASCII character set.
Cf: http://www.ewos.be/tg-cs/gis4873.htm
ECMA-43:1991 (identical to ISO/IEC 4873:1991) Standard specifies three nested levels of implementation :
- Level-1 comprising of the following facilities:
- a C0 set;
- the character SPACE represented by bit combination 01/00;
- the G0 set;
- the character DELETE represented by bit combination 07/15;
- a C1 set; and
- a G1 setAt level-1, no shift functions shall be used and the G0 and G1 sets are assumed to be invoked permanently in columns 02 to 07 and 10 to 15, respectively.
At Level-1, the C1 set and/or the G1 set may be empty if there are no requirement for control characters in addition to those provided by the C0 set and/or graphic characters in addition to those provided by the G0 set.
At Level-1 a version shall not include a G2 or G3 set.
(A G2 set consists of 94 or 96-character set of graphic characters at bit combinations 10/00 to 15/15 and the character sets of G2 are invoked either by the single-shift function SS2 or by the locking-shift function LS2R.
A G3 set consists of 94 or 96-character set of graphic characters at bit combinations 10/00 to 15/15 and the character sets of G3 are invoked either by the single-shift function SS3 or by the locking-shift function LS3R. )
Level-2 and Level-3 versions of 4873 Standard correspond to having G1 set replaced by G2 and G3 respectively.
ISO/IEC 10367:1991
Standardized coded graphic character sets for use in 8-bit codes (first edition).
Description:
ISO/IEC 10367 specifies a collection of coded graphic character sets suitable for use within the structure of an 8-bit code as laid down in ISO/IEC 4873. These sets are all suitable for use as any of the code elements G1, G2 and G3 in a version of ISO/IEC 4873 at any of its three levels of implementation. The G0 code element of ISO/IEC 4873 is prescribed by that standard but is repeated for information in ISO/IEC 10367.
ISO/IEC 10367 does not specify the sets C0 and C1 of control functions that may be used in a version of ISO/IEC 4873 that conforms to ISO/IEC 10367.
cf: http://www.ewos.be/tg-cs/gis10367.htm
Websites of Standardisation Agencies
- European Computer Manufacturers Association, Geneva, Switzerland
- International Standards Organization, Geneva, Switzerland
- Website for the Working group 3 /Sub-committee SC2/Joint Technical Committee1
- of ISO, dealing with 7-bit and 8-bit coded character sets
Tamil Script Code for Information Interchange (TSCII) I as a 8-bit Coded Character Set
The 8-bit bilingual glyph encoding based Tamil Standard Code TSCII proposed by the Internet Working Group for Tamil Standard Code meets all the requirements for registration as per International Standards Organisation (ISO ) standard ISO/IEC 4873:1991 Level 1 Specifications as indicated below.
Note: Vietnamese Standard Code VSCII and Russian Language Code KOI8-R are examples of officially recognized 8-bit character sets, very similar in structure to TSCII.
ISO:IEC guidelines for the 8-bit Coded Character Set Standards ISO/IEC 4871:1991 and ECMA-43:1991 view the entire block of 256 glyph slots into four segments: C0 (Control-0), G0 (Graphic-0), C1 (Control-1) and G1 (Graphic-1) Segments with explicit specifications on what can be in each of these four segments. Figure 1 shows graphically typical composition of a 8-bit coded character set.

A brief overview of 8-bit character sets that have graphic characters in the C1 block
In the last decade, ISO-8859-1 (aka as Latin-1) character set has been the most popular and widely used character set and found a larger user-base when HTML 3.0 protocols chose to have Latin-1 as the default standard for HTML documents diffused in the internet. This Latin-1 set does not have any graphic characters placed in the C1 block. Many of the softwares written for the English-speaking and European Market are based on this Latin-1 character set as the standard. This has led to many to believe that one cannot place graphic characters in the C1 block.
The following are examples of 8-bit character sets of major computer manufacturers that have graphic characters in the C1 block.
MS-DOS Code pages
- CP437 (DOSLatinUS) used once by the IBM Personal Computer
- CP852 (DOSLatin2) for European Languages
- CP855 (DOSCyrillic) for Cyrillic
MS-Windows Code Pages
- CP1252 (Win Latin1, aka as Windows-1252) Microsoft character set for Windows OS, supercede CP437 used earlier by IBM PCs; a superset of 8859-1 scheme

- CP1250 (WinLatin2)
- CP1251 (WinCyrillic, aka as Windows-1251)
Apple
- MacRoman encoding

NeXT
- NeXTSTEP
Hewlett-Packard
- HP-Roman8
KOI8-R for Russian (Cyrillic) and VISCII for Vietnamese are recent examples of 8-bit character sets with graphic characters in the C1 block that have been recognized as International standards through RFCs.
Russian /Cyrillic character code set KOI8-R
URL for KOI8-R Homepage:http://www.nagual.pp.ru/~ache/koi8/main.html
Russian section of the Internet (the relcom.* newsgroups) has been using KOI8-R as their character encoding for discussions in Cyrillic. In view of its wide popularity, Andrei Chernov et al formalised its registration as an international standard by registering KOI8-R character set through RFC 1489 . . This procedure let to establishment of KOI8-R as the de-facto standard on the Internet. KOI8-R which was later also numbered code-point 878, CP878 .
The following gif shows the content of the 128-255 slot assignments of the 8-bit character set KOI8-R:

Vietnamese Character set VISCII
URL for VISCII homepage: http://www.vietstd.org/vietstd/index.htm
VISCII was developed in 1993 by the Vietnamese Standardization Working Group Viet-Std@Haydn.Stanford.EDU.
VISCII became an international standard with registration of its character set through RFC 1456.
The following gif shows the content of the 8-bit character set VISCII:


Comments
Leave a Reply
Please note: Comment moderation is currently enabled so there will be a delay between when you post your comment and when it shows up. Patience is a virtue; there is no need to re-submit your comment.
You must be logged in to post a comment.