happy gif Webpages of Tamil Electronic Library © K. Kalyanasundaram



Details of the Tamil Script Code for Information Interchange
(TSCII encoding Standard for Tamil computing)


Note: This Proposal outlines a Tamil Font Encoding Standard TSCII that was evolved through mailing list discussions in the Internet over a two year period (1995 -1997). The draft was put up for comments and evaluation by the internet audience in late '97 and was available as a web page for a period of six months. Field tests were carried out in several areas of Tamil computing using sample font faces. The enclosed Draft is the final /revised version that came outcome of this exercise. Recently (Nov. 1998) the present proposal drafted by the Internet Working Group for the Tamil Standard Code (IWG-TSC) has been formally presented to the Special Advisory Committee of the Tamilnadu Government for consideration and possible adoption.

A Proposal for A Tamil Standard Code For Information Interchange
(TSCII)


Tamil

Tamil is one of the two classical languages of India. It is the only language in that country which has continued to exist for over two thousand years. It is spoken today by approximately 65 million people living mainly in southern India, Sri Lanka, Singapore, Malaysia, Africa, Fiji, the West Indies, Mauritius and Reunion Islands, United Kingdom, United States and Canada. Tamil is the pre-eminent member of the Dravidian Language family and has one of the longest unbroken literary traditions of any living language in the world. [1]

Information Processing in Tamil

Dravidian Languages such as Tamil use non-roman letters as alphabets. Hence typing of text materials in computers of these Indic languages requires use of either specific font-faces and/or word-processing software. In spite of this limitation, word-processing of tamil text materials on computers has been taking place well over a decade. Many different fonts and packages have been developed. With the availability of free tamil fonts in the internet during the last two years, there has been a phenomenal growth in the number of web sites dealing with matters of interest to Tamils at large. There are already a number of tamil language newspapers and many popular, literary magazines available "On-line" in tamil script. There are web sites devoted to collection of electronic texts of tamil literary classics, language learning etc. in tamil script. A comprehensive listing of over 600 web sites of interest to Tamils is also available on the internet [2]. Currently nearly all of the tamil computing is at the word-processing level. We do not yet have dedicated softwares for other applications such as those for databases and multimedia

In the absence of any organised effort to co-ordinate and promote tamil information processing at the national and international level, many different fonts and desktop publishing softwares have been developed at different parts of the world. There was hardly any standard protocols observed in the development of these key tools for tamil information processing. This in turn has led to the present (rather unfortunate) situation that, one needs to download and install several tamil fonts or packages to be able to access most of the materials of interest to tamils available on the internet. An International Conference devoted to Tamil Information Processing, called TamilNet'97 (the first of its kind) was organised early this year at Singapore by the Internet Resources and Development Unit (IRDU) of the National University of Singapore to discuss the situation and propose possible standards. Some of the key papers presented at this conference [3] (including one -a broad overview of features of different fonts and DTP packages currently in use [4] ) are available on the internet.

Recent Efforts Towards Standardisation

Recently there have been three series of efforts, all directed towards standardisation of the tamil information processing on the internet. Firstly there have been a couple of national and international conferences on this topic, including the TamilNet'97 mentioned earlier. Secondly, the Tamil Nadu Government recently set up an expert committee /task force ("The Tamilnadu Computer Standardisation Committee") to examine the situation and make recommendations. This committee has made its first recommendations for tamil keyboard layouts [5].

Tamil Language has been fortunate to have several Email discussion lists in the Internet. Discussions have been taking place in Tamil already for about three years. Of relevance here are two such lists: one operated by Asia-Pacific Internet Company (APIC) called tamilnet [6] (webmasters@tamil.net) and one operated by IRDU unit of the National University of Singapore called TamilWeb (tamilnet@tamilnews.org.sg) [7]. For over a year, tamil lovers from different parts of the world have been discussing in these email lists, the urgent need to have an international standard for tamil computing. Participants for these discussions come from different walks of life (software developers, academics at the universities and ordinary/simple end-users). Recently the mailing lists were merged into a single one. The proposal for a new international standard for Tamil information interchange discussed in this document is the outcome of these deliberations (exchange of several hundred Emails amongst several hundred participants over several months, copies available in the archives of the above web sites!!!).

Current Standards for Tamil

Before we elaborate on the proposed standard, it is pertinent to review the current standard for Tamil Computing if any. Officially there exist two conceptually similar encoding standards that treat Tamil within the framework of multi-lingualism: ISCII and Unicode. As elaborated below, both are "evolving" standards, not fully implemented on many of the widely used Internet protocols.

  • Indian Standard Code for Information Interchange (ISCII)[8] : In early eighties, the Dept. of Electronics of the Govt. of India set up an expert committee to set up standards for information processing of indic languages. The Indian Standard Code for Information Interchange (ISCII) first launched in 1984 is the outcome of this exercise. The Indian Standard Code ISCII is a 8-bit /single byte umbrella standard, defined in such a way that all Indian languages can be treated using one single character encoding scheme. ISCII is a bilingual character encoding (not glyphs-based!) scheme. Roman characters and punctuation marks as defined in the standard lower-ASCII take up the first half the character set (first 128 slots). Characters for Indic languages are allocated to the upper slots (128-255). The Indian Standard ISCII-84 was subsequently revised in 1991 (ISCII-91). ISCII was re-affirmed in 1997. Along with the character encoding scheme (ISCII), the Govt. of India also defined a keyboard layout for input called INSCRIPT. The research and development wing of the DOE, Govt. of India (called Center for Development of Advanced Computing, CDAC based in Pune, India) has developed software packages based on these Indian standards. Multilingual and Multimedia products are based on Graphics and Intelligence-based Script Technology (GIST) (Email: gist@cdac.ernet.in). There are only a handful of implementations of ISCII by third parties available in India.

  • UNICODE[9] Unicode is an international standard for multi-lingual word-processing, being developed by the Unicode Consortium, an Internet Group that includes practically all major computer Hardware and Software Companies of the world. Unicode is a more ambitious 16-bit /double byte character encoding scheme with provisions for over 65000 slots to handle nearly all world (50+) languages simultaneously. Along with other Indic languages, Tamil has been assigned specific slots U+0B80 -> U+0BFF (which, in decimal, is 2944 -> 3071; 128 locations) in this multi-lingual standard [10]. For obvious reasons, the choice of characters in UNICODE for indic languages is based on the indian standard code ISCII. Microsoft has already implemented Unicode in its Windows 95/NT OS and even distributes a unicode font free for multi-lingual word-processing[11]. These fonts do NOT yet include any glyphs for the indic language segments. Apple has released recently a multi-lingual package for indian market based on ISCII [11] but this package does NOT include, yet, the glyphs corresponding to Tamil.

Need for the Proposed Standard for Tamil

If ISCII and UNICODE standards already exist for information interchange of indic languages (including tamil), a natural question is why propose another standard for tamil. Listed below are some key arguments advanced in this context:

i) Based on "character-encoding" concept, both ISCII and Unicode leave the screen rendering of the Tamil alphabets to software developers. Implementation of these standards are through additional hardware cards (as with GIST interface card of CDAC) or through dedicated softwares that invoke advanced font-handling technologies [13] such as Glyph substitution, available in the still evolving font specifications, like Truetype Open and Truetype GX [14]. Sophisticated font-handling techniques such as glyph substitution (GSUB)or Open Truetype Fonts are available only in state-of-the-art computers running under the latest versions of the OS software (as is the case with Windows and Macintosh). Consequently usage of these two standards involve additional investment costs and a higher level of understanding of computers. A layman / simple Tamil user is precluded from doing any simple word-processing of Tamil texts on earlier generation computers. In view of high costs involved, ISCII usage has been restricted largely to corporate houses, Government offices and business establishments such as Banking, Telecom, Transport, etc.

ii) Dravidian languages are notorious for their complex glyph structures. The necessity to go for advanced font handling techniques such as glyph substitution further puts us to a disadvantage as we will have to wait for applications (DTP, Word Processing etc.) to be developed from scratch for Tamil and we may not enjoy the luxury of using off-the-shelf applications that were developed for English *as-is* in Tamil. As elaborated later, proposed TSCII allows Tamils usage of vast amount of public-domain and commercial softwares (for word-processing, graphics, database,..) that already exist in English/European Languages.

iii) Using Devanagiri script as the reference language, ISCII defines a certain encoding scheme for all indic languages including the dravidian languages such as tamil, telugu and malayalam. Many of the scholars of th dravidian languages are highly critical of this approach. The phonology and the script usage of dravidian languages are very different. There are many characters in Tamil and Malayalam for which there are no equivalent devanagiri ones. Compromises are made by allocating extra slots to introduce these additional characters. By treating all indian scripts under one scheme, ISCII philosophy does not take advantage of the fact that Tamil *can* be encoded in a simple form that seamlessly integrates with existing computing platforms without requiring specialised rendering technologies.

iv) ISCII and Unicode are not the only avenues open for Tamil information interchange. It is worth pointing out that these are "evolving" standards. Before their emergence, for several decades, information processing and exchange in major languages of the world has been going on and these are via usage of simple, self-standing 7- and 8-bit fonts. The only problem with these Tamil fonts is that no standard encoding scheme has been used. So, the exchange of Tamil text files is not simple and one needs to use converters to go from one scheme to other. Web (read World-Wide-Web) based information exchange is fast growing as the rapid, cost-effective means of data exchange across the world. A standard encoding scheme for these Tamil fonts can simplify the exchange enormously. European languages, for example, have been fortunate to have several character-encoding standards defined and universally implemented.

There are several advantages to develop a Tamil standard for information interchange that is based on simple, self-standing fonts:

  • i) Once installed in the system, they could be used practically on all applications directly without any extra software/hardware intervention;

  • ii) The development of fonts corresponding to one encoding scheme can be easily implemented to other computer platforms (particularly between Windows, Macintosh and Unix) and it is rather straight-forward. The task is so routine and simple that, growing number of fonts and Tamil learning softwares are being made available FREE on the Internet even by the amateurs.

  • iii) World-wide, FREE Distribution of a self-standing Tamil font will lead to vary rapid standardization of information interchange, as has been the case with most of the European, Russian and Japanese languages. Up till recently (when free Tamil fonts appeared on the Internet), Tamil word-processing required purchase of a Tamil font for at least US$50 (much higher for DTP packages). No language can flourish in the emerging computer era if the basic fonts required for routine tasks come either as part of the computer system software or available to the user free of cost and without any restrictions.

Design Goals of the Proposed Standard

  • 1. Establish a consistent International Tamil character encoding standard that in turn lead to a self-standing Tamil font usable on all widely-used computer platforms (PCs, Macintosh and Unix), particularly on earlier models and operating systems (cover at least those that appeared within the last decade).

    Tamilnadu Government very recently has embarked on an ambitious plan to provide Internet-access booths all over the state. This will certainly increase the awareness of computer utility amongst lay Tamils, who will be interested to get on to Tamil computing on whatever computer they can have access to. In such a scenario, it is most likely that, all early generation computers that have been produced in the last decade will be put to use (e.g. AT/XT PCs capable of running early versions of Windows). It will be a great disappointment to all lay Tamils if the standards require expensive, state-of-the-art computer systems for use.

    A Tamil font defined very much like the roman font such as Times or Helvetica, once installed in the system, can be used on all software packages supported by the respective OS without the need for additional software/hardware intervention. It is likely that over 90% of Tamil computing is in the form of simple word-processing of plain text. The encoding standard must be such as to be readily implemented in most of the widely used computer platforms (UNIX, Windows and Mac). The input of Tamil materials will be in all these three platforms. On the Internet, the information exchange may involve all of the three OS (sender could use a Windows PC, the recipient uses a Mac and the intermediate mail server a Unix-based computer)!

    Fortunately in the last three years, procedures have been developed for production of fonts with identical encoding scheme that work under these different platforms. Information exchange via Email and WWW has also been perfected that, no serious problems are anticipated in rapid implementation of the proposed scheme on all three OS. Tamilnadu Govt. is willing to undertake the task of producing one such Tamil font and distribute it free on Internet. Free distribution of a handful of such fonts will not deprive the software market. There will always be a need for specially designed fonts for professional usage (in publishing houses), very much the same way the font market still exists for roman fonts (Adobe and others continue to make millions marketing roman fonts!)

  • 2. The encoding will be glyph-based, at the 8-bit bilingual level, using a unique set of glyphs and the usual lower ASCII set . Roman letters with standard punctuation marks occupy the first 128 slots and the Tamil glyphs occupy the 'upper-ASC' segment (slots 128-256).

    Why a 8-bit bilingual scheme?

    i) Almost all of the European languages (representing several hundred million computer users!) currently employ such 8-bit bilingual scheme, commonly known as ISO 8859-X schemes. Such 8-bit schemes are proven standards widely implemented by all major computer platforms. So, in terms of identification and implementation, the scheme is rather straight-forward even for non-Tamil speakers.
    ii) A 8-bit scheme with lower ASCII part in the first 128 slot can facilitate enormously the smooth flow of information across the Internet in all of the commonly used protocols (SMTP, FTP, HTTP, NNTP, POP, IMAP,..) All non-Tamil speaking personnel entrusted with communication flow (postmasters, system administrators,.. particularly those outside India and outside Tamilnadu) can easily follow the content, its originator, destination etc. and ensure their smooth exchange across platforms and communication protocols used in the Internet.
    iii) Tamilnadu as a constituent state of India works under a bilingual scenario with both English and Tamil as the languages for official communications. With a single font it will be possible to correspond in either or both of the languages. ISC-II standard of the Govt. of India is also defined in a similar way.

    What does it mean by a unique set of glyphs?

    Tamil has far too many alphabets to be accommodated as a single glyph in the 128 slots left. So, depending on the complexity of the character (and its rendering) the scheme may use one, two or three bytes to define a single alphabet. But the choices of glyphs are such that, each of the 250+ Tamil alphabets (uyir, mei and uyirmei) are represented by one and only one way.
    In the past, Tamil language used alternative glyphs for some of the Tamil alphabets (e.g. forward kombu/kokki to write lai/Nai/nai, Raa, Naa and Naa, referred to as ORNL). A unique definition scheme implies that there is no place for these old style characters in the encoding scheme.

    Why not character encoding as in ISCII and Unicode?

    If the glyph encoding scheme is unambiguous in defining the resulting character set, then it does not really matter if one choose to encode glyphs or characters. Defining a unique set of glyphs leading to a unique definition of all of the 250+ Tamil characters makes the glyph encoding scheme unambiguous. Defining glyphs also defines the rendering part of the characters. The fact that we already have successful functioning of several Tamil fonts in the market is a clear proof on the validity or implementation part of the approach. As mentioned under (1), the glyph encoding scheme allow design of self-standing simple fonts.
    It was pointed out earlier that, defining characters alone and leaving the rendering part to the software (as in Unicode and ISCII), requires dedicated, expensive hardware and/or softwares. Unicode fonts and Apple Multi-lingual package (currently with Devanagiri and Gurmuki alone) can be used only on the latest generation computers with Power PC chips and current OS software !!

  • 3. The Tamil standard must be an open standard.

    Practically all of the Tamil fonts and softwares that are currently in use world-wide are the recent work of individual authors and hence are subject to copyright protection of some sort to the authors. The copyright protection to authors is very clear with DTP packages. But when it comes to fonts, the scope (what can be subject to copyright and what is not) is very hazy and protection vary from country to country. So it is desirable to develop a true international open standard. Also, this approach will avoid the unpleasant situation of giving an extra-edge to someone by picking up the encoding of his/her existing font/software as the standard.
    The proposed Tamil "encoding" scheme and associated "Keyboard Input Options" are open standards - i.e. no one needs to seek permission or state credit to implement the standard in any applications, including commercial, freeware and shareware versions. But the "implemented" software may or may not be copyrighted by the developer - this is entirely the developers discretion. It may be mentioned here that Unicode is an "OPEN" standard as envisaged here for TSCII. But, as of date, both ISCII and the associated INSCRIPT keyboard are "propriety" standards owned by the Govt. of India.

  • 4. The encoding scheme should be universal in scope. The Tamil standard must be include all characters that are likely to be used in everyday Tamil text interchange.
    For centuries Tamil language has grown with several grantha characters added on. The usage of these grantha characters along with pure Tamil ones is so deep-rooted in the day-to-day usage of Tamil by the common man. Hence the inclusion of these grantha characters becomes essential under the above criterion. Both ISC-II and Unicode recognize this situation and have provided specific slots for a number of grantha characters.

    Unlike many of the Tamil fonts and software packages that leave out rarely used Tamil alphabets (such as ngu, ngU, nyu, nyuu), the present scheme ensures their presence. This has been done so that multimedia and softwares for teaching Tamil can display all of the Tamil alphabets without exception.

  • 5. The encoding standard must be Unicode and ISCII compatible.

    What does Unicode compatibility means in terms of glyph choices? The glyph choices are to be such that, a one-to-one correspondence mapping table between the alphabet/character definitions under the present scheme and Unicode / ISCII can be established. A draft of one such mapping table is presented herewith in the annex section. Using such a table, it will be possible to save a TSC-based file in either format.

    Both Unicode and ISCII scheme include a number of Tamil numerals. So the present scheme need to include these Tamil numerals. Else there cannot be a one-to-one correspondence between these forthcoming standards.

  • Why Unicode, ISCII compatibility?

    There are major advantages by ensuring this compatibility with the "emerging" standards.

    i) It is an undeniable fact that the world is heading towards multi-lingualism. This is particularly true for a country like India where the migration of people amongst different constituent states is very pronounced. The encoding standards for "multi-lingualism" Unicode and ISCII are still "evolving" and are not fully established (particularly in the implementation of most of the widely used Internet information exchange protocols such as SMTP, HTTP, NNTP, POP, PDF,...),. Hence it is proposed that the Tamil community start using TSCII as an "interim standard" and move on to multi-lingual standards (of either Unicode or ISCII) on a later date. A clean compatibility will ensure that, all Tamil materials generated in TSCII format be made available in Unicode/ISCII format at all times - present and future. None of the TSCII-based resources will be lost when Unicode/ISCII become fully functional.

    ii) Secondly the present glyph encoding scheme can happily co-exist with the more sophisticated Unicode/ISCII schemes and even can make way for smooth transition to Unicode at a future date. Indian language Packages for Unicode and ISCII are very expensive and have started appearing in the market only very recently. It is still largely under-explored domain for fool-proof implementation.




  • 6. The standard must be usable and co-exist with other existing software until Unicode compliant software becomes available.

    One-to-one correspondence table in the character definition as per the proposed standard with the popular Tamil fonts/DTP packages will ensure smooth transition and recovery of all the archived Tamil text materials produced till this date. There exist already conversion softwares that allow inter conversion of Tamil text files prepared using different font encoding schemes. Such conversion softwares based on the proposed Tamil standard will be made available to promote rapid and smooth transition to the new standard.

  • 7. The Tamil encoding standard should allow rapid implementation of many of the routine tasks required in large databases (such as search or sort).

    It is very likely that with the widespread growth of a true international standard for Tamil, large databases (library catalogues, electronic telephone directory, land/property registry, inventory of materials in departmental stores etc. etc.) are built based on Tamil script. Routine usage of these databases often require search or sort routines. The encoding scheme should be such as to allow development of softwares for these without unnecessary demand for huge computer memory or processing capacity.

  • 8. The output of the Tamil standard (Tamil text) should be independent of the input mode.

    It is important to realize that, with glyph-encoding based font faces, text input process using the keyboard is totally independent of the glyph choices that go to constitute the font encoding scheme. Using different keyboard editors, it is possible to use the same Tamil font face and input the text using one of the several keyboard layout options: those based on Tamil typewriter layout, phonetic, romanized or transliterated,...... The resulting Tamil text will be identical in all these cases, format being determined by the encoding scheme. Keyboard editors allow facile toggling between the roman and Tamil segments and the Tamil characters can be accessed directly through the roman keyboard. For European Languages based on 8859-X encoding schemes, several keyboard editors to toggle between the standard US mode to French, German, Finnish, Swedish,.... already come part of the OS software. Once the proposed encoding scheme for Tamil becomes the standard, similar keyboard editors for Tamil text input can be made available as a system software.

    There are several popular methods of input for Tamil and these are considered under different keyboard layouts: classical Tamil typewriter, romanized and phonetic or transliterated. Several Keyboard editors that allow input according to these different methods have already been developed and these can be readily adapted to include the proposed encoding scheme as the reference chart for the font in question.

  • 9. As with the Unicode standard, the "proposed standard" does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor does it encode logos or graphics. Artificial entities, whose sole function is to serve transiently in the input of text, are excluded. Graphologies unrelated to text, such as musical and dance notations, are outside the scope of the proposed standard.

    One possibility would be to agree on a supplementary ding-bat type font for exclusive usage amongst the Tamil community - one that contains symbols such as OM, religious symbols, arrows, Greek symbols etc. If the all Tamil web pages use these two (one official Tamil font and a second de-facto standard dingbat style font ), we can easily add some color and liveliness to the world of Tamil computing.


Click here to go to Part II (providing a description of the Encoding scheme, rationale for glyph choices and slot allocations).
Click here to go to the Web page carrying the Annexes.

A draft version of the proposal was first put up in the Internet on Dec. 2, 1997 and this file was last revised on October 27, 1998.
Please send your comments to Dr. K. Kalyanasundaram


Please feel free to leave your comments here
Name:
Email:
Location (city, country):
Url:
Comments:

Google
Web tamilelibrary.org



Like to order Tamil books?
click here to see a list of Books of interest to Tamil Diaspora that you can order directly from




Click here to go to Guestbook page

Page visits to the site since Nov 16, 2005: