AN OVERVIEW OF DIFFERENT TOOLS FOR WORD-PROCESSING
OF TAMIL AND A PROPOSAL TOWARDS STANDARDISATION
(part I of the paper)
(click here for part II)
Institute of Physical Chemistry, Swiss Federal Inst. of Technology,
CH-1015 Lausanne, Switzerland
(Invited Paper to be presented at the "International Symposium for Tamil Information Processing and Resources on the Internet,
National Univ. of Singapore, Singapore, 17-18,May 1997 ) Introduction
Dravidian Languages such as Tamil use non-roman letters as alphabets. Typing of text materials in computers of these Indic languages requires use of either specific font-faces and/or word-processing softwares. In this paper, features of some of the most commonly used tamil font faces and softwares are reviewed and a possible scheme towards standardisation of Tamil Computing is also indicated. The term 'Tamil Computing' is used in a narrow sense to cover the area of word-processing of tamil-related materials on computers. Tamil Computing cover a much broader domain with applications in many areas: tools for larger databases of different kinds using tamil script, multi-media kits involving tamil, multi-lingual dictionaries and translation softwares etc.
In the last two decades, many different fontfaces and desk-top publishing (DTP) softwares have appeared for word-processing of Tamil and along with them different typing (input) methods. Some of these are based on simple recasting of the tamil typewriter keyboard in the form of 7-bit fonts. Others are sophisticated 8-bit font/word-processing packages where the actual keystrokes and their relative sequence are interpreted to provide the required tamil texts. These packages allow different modes of input including romanized/transiliterated input. Font Encoding, i.e., the exact location of different tamil characters in the standard extended ASCII table (128 or 256 slots) in the tamil font being used determines the 'output' content of the tamil text irrespective of the mode of 'input'. Tamil text files created using one font/DTP pacakge cannot be read using another font unless the font encoding scheme is identical between the two fonts in question.
There is a growing number of tamil pages being put on the Internet/WWW using fonts, packages with different font encoding schemes. So we are now in an unpleasant situation: One needs to acquire and install as many fonts as the number of tamil web pages and archives available on the internet. Necessity for setting standards arises also from the growing trend to exchange/share information between individuals placed in different parts of the world. In the absence of any standard protocols by which the information storage is carried out at the font-encoding level, information exchange on a world-wide become too complex for many of the concerned individuals, if not impossible. Majority of the end-users (Tamil community at large) are not well-versed in technical aspects of data storage, transfer. So procedures have to be designed so that ordinary/common people can put up web pages and share information electronically in tamil world-wide without getting involved too much into the technical nitty-grittys. Any proposals for standardisation needs to accommodate the current typing habits/preferences (some kind of backward compatibility).
Transliterated/Romanized form of Tamil
By transliterated/romanized tamil text, we refer to reproducing in a near-close phonetic form, the tamil texts using roman alphabets. Thus, the tamil word for father is written as appA (or appaa), mother as 'ammA' (or as ammaa). Transliterated form of reproducing dravidian language materials has been popular amongst western indologists for well over a century (pre-modern computer Era). Even standards were discussed and adopted in an international conference as early as 1888.
The earliest and widely used transliteration scheme is what is known as Library of Congress Scheme which uses roman alphabets with diacritics (horizontal bars or circles added above or below roman alphabets) to represent alphabets of dravidian languages. Figure 1 shows pictorially this and other transliteration schemes for Tamil discussed in this paper. Fig.1 LC and plain ASCII type Transliteration schemes for Tamil
Diacritical markers added to a letter or symbol show its pronunciation, accent, etc., typically indicating that a phonetic value is different from the unmarked state. The scheme is very general in scope and hence can be used in all of the indic languages. Established tamil research centers all around the world are aware of this scheme and most of them implement this scheme as such without modifications. In Chennai, Institute of Asian Studies (engaged in publishing many of the tamil literature related research) and Roja Muthaiah Tamil Research Library with links to Univ. of Chicago (involved in electronic cataloguing of 50000+ precious Tamil books collections) are examples of institutions that follow this scheme.
Given the practical constraints on the scope of present day electronic communications ( largely 7-bit) alternate transliteration schemes based on plain ASCII characters have also been in use widely. Figure 1 also includes some of the commonly used transliteration schemes of this kind. Plain ASCII scheme was considered in the early pre-computer era but was abandoned as being non-practical. In the last two decades with the growing use of computers, there is an increasing number of individuals and institutions that employ some form of a 'transliteration scheme' based on plain ASCII roman characters. Presently most of the postings on the USENET Newsgroups of Internet such as soc.culture.tamil quote tamil texts in the form of romanized text, for display on plain ASCII terminals. MADURAI software uses a code to construct tamil alphabets on screen in four lines using ASCII letters. Though it is not "print quality" it allows to convey the message in quasi-tamil script. The classic 10-volume reference work "Tamil Lexicon" published by the Univ. of Madras during 1929-1939 used the transliteration scheme based on plain ASCII. The Institute of Indology and Tamil Studies of Univ. of Cologne (K?ln) uses this shceme for the cataloguing of their 50000+ tamil books collections and also for their extensive collection of electronic texts of ancient tamil classics (e.g, Sangam Literature).
As said earlier, writing in the LC form of transliterated tamil on Computers requires special fonts that contain roman letters with the diacritrics. Library of Congress and major Tamil libraries in the USA and Europe allow on-line search of their catalogues from anywhere in the world. In order that searches can be made using simple ('dumb') terminals, on-line catalogues allow search using plain ASCII characters without the corresponding diacritical markers. Thus, one has to use keyword 'anil' for squirrel while searching LC or Univ. of California, Berkeley. But, at the IITS library of Univ. of K?ln where the indexing is on alternate transliteration scheme (based on plain ASCII), the search would be as 'aNiL' ! Thus, here we have an anamolous
situation where care has been taken to catalogue books using a special
font (not readily available) but all its features are lost while doing search using plain ASCII characters. There is also the practical problem that one has to first educate oneself as to which form of transliteration scheme used at the place of search.
In view of the above points, it is essential that, some consensus be reached on a universally adopted transliteration scheme. As will be discussed below, there are now DTP softwares that allow 'input' in romanized text format. Here also it would be better if some standard form of transliteration scheme is universally adopted. Our preferences are for a scheme such as that used in Adhawin/Madurai, one that allow writing in near phonetically equivalent form but using plain ASCII characters.
Word Processing using 7-bit tamil fonts (direct output)
Since tamil typewriters have been in use for many years before the advent of computers, it is logical that early approaches to tamil computing involved implementing the classical typewriter in the form of 7-bit fonts. Various tamil characters are placed under different roman letters at the equivalent locations of the tamil typewriter. All of the tamil alphabets are obtained by using the normal and shift-mode operation of the standard keyboard. While some of the alphabets are obtained in single keystroke, others are obtained by two or three keystroke operations. With such tamil fonts, those who are accustomed to typing on tamil typewriter can make the transition to tamil computing without difficulty and loss of any typing speed. This trend is very strong in Tamilnadu even today. Majority of tamil computing use the tamil typewriter keyboard layout(s). So any Tamil Computing Standardisation efforts need to take this reality into account. There are many fontfaces of this type available: TAMILLASER of Prof.George Hart, ANANKU of P.Kuppuswamy (widely used in continental US), SARASWATHI of Vijayakumar (widely used in canada) are some examples. BHARATHI word processor for plain DOS computers was one of the early ones to appear (in early eighties) in Malaysia and Singapore region. VENUS is a recent, updated version of this word-processor running under Windows environment.
The common logic in any keyboard layout design is to have most commonly occurring letters placed in the central/middle part of the keyboard (and less frequent ones moved to left/right extremes). This concept/logic was applied quite a while ago in the design of typewriters. In Tamil, in good old classical tamil typewriter layout, one particular assignment was chosen:
middle line ya, La, na, ka, pa, modifier for aa, tha, ma, ta in middle line; nga, Ra, n^a, ca, va, Na, ra, sa, zha, modifier for i in the top line and ii, la, o, u, e, ti, modifier for e, a, i at the bottom line.
There have been many re-examination of this concept of character placing for tamil keyboard recently. Mohan Tambe of CDAC, Pune designed a keyboard layout for indic languages using such an analysis. Naa. Govindasamy (host of this conference) has made similar analysis for tamil and has designed the Kanian/IE/Singapore Tamil Keyboard layout.
An alternative approach to tamil typewriter keyboard layout involves phonetically linking tamil characters to be typed to corresponding roman letters. Thus you hit the key k to get ka, m for ma, l for la, p for pa, k followed by i for ki, k followed by I to get kii and so on. For those who never used the tamil typewriter, this approach can be intuitive and very appealing. Since tamil characters of 7-bits are readily accessible via normal and shift-modes of the keyboard on all computers, I designed a phonetically based 7-bit font called MYLAI. The term 'phonetic' is used in a slightly different context by many (e.g participants of this conference Naa. Govindasaamy, Ravindran Paul). So we would use the abbreviation WYTIWYG (what you type is what you get) layout to refer to keyboard layouts based on the above cited phonetic input method. The frequency of occurrence of tamil characters in tamil need not necesarily be the same as in English. So I had some reservations on sustained interest for people to use keyboard layouts of the WYTIWYG kind. To my pleasant surprise, the reception to Mylai keyboard has been overwhelming. In the last three years, several thousand tamil lovers all around the world have received a copy of the Mylai font and happily using it for tamil computing. Some even wrote to say that, with the satisfaction in Mylai, they have been deleted some tamil font faces of classical typewriter kind that they bought earlier for a price. I should state here that, mylai was not the first tamil font available free on the internet (there have been several others freely available) nor it gives the most aesthetically pleasing print out for very demanding end-users. Fig.2 Mylai phonetic/wytiwyg keyboard Layout
Word-processing using 8-bit font faces (direct output)
If one counts the number of alphabets of tamil, we have over 230 (13 vowels, 18 consonants and products (uyirmeis) derived from these. Tamil is one of the Indian languages where many of the compound (uyirmei) alphabets have complex geometric structure (glyph) of their own. In 7-bit fonts with 128 slots, nearly half of them are not available for placing tamil characters ( first 32 slots reserved for control characters, 10 places for roman numeral and another 10 or 12 for various key punctuation marks). For the number of tamil alphabets to handle, the remaining positions are rather limited. In 7-bit fonts, a number of compound/uyirmei letters are obtained simply by adding a modifier glyph to the parent consonant. Tamil typewriter uses this concept extensively. 'Kerning' is a technique that allows controlled fusion of two successive character. Unfortunately, kerning is not easily implemented on many computer platforms. Without kerning, the quality of the output for on-screen display and in print using such 7-bit fonts can be far from satisfactory, at least for commercial publishing houses. So, there have been efforts to go for fonts of the 8-bit type (256 slots available). 7-bit and 8-bit fonts have their own merits and demerits. We will return to this topic later on.
In the absence of kerning and other character control features, in many of the software packages designed for publishing houses, many of the tamil uyirmeis with complex structural forms are included as such in the upper ASCII part (128-255). This way aesthetic quality print can be ensured.
In the Macintosh OS, it is easy to access many of these characters in the upper ASCII part using the 'option' and 'shift-option' keys. T. Govindaraj (of USA) designed a 8-bit tamil font for Mac called PALLADAM making use of this feature. In this font design, tamil alphabets ma, mu and muu, for example, are obtained using the keys m, shift-m and option-m respectively. In Windows, one needs to have the 'alt' key down and type in the three digit reference number of the character in question preceded by a zero, as in 0172 or 0213. One needs to remember these numbers to be able to type at reasonable speed. So keyboard editors/managers are often used. With these keyboard editors, one can access any character using any key irrespective of the font encoding scheme used.
RAMINGTON TAMIL is an example of the 8-bit extension of the classical tamil typewriter keyboard. In addition to 26 slots occupied by roman numerals (10) punctuation marks (11) and mathematical operators (5), 78 tamil characters are placed in the font face. On Windows-based PCs, the alt-key is used to obtain those extra tamil characters. Softview Computers of Chennai markets tamil word-processors that work on the Ramington Tamil keyboard. Fontfaces with this Ramington Tamil keyboard layout are used extensively by the publishers of Tamil Newspapers and Magazines of Chennai. Fig.3 Ramington Tamil Keyboard Layout
Word Processors based on romanized input (intepreted output)
ADAMI was one of the early tamil word-processors for MS-DOS PCs produced by Dr.K. Srinivasan of Canada in early eighties (released in 1984 for CPM-80 computers) to recast such transliterated text into Tamil. The tamil text is to be typed using a plain ASCII transliteration scheme. Upon compiling/execution of the linked macro, this romanized text page is recast on screen in equivalent Tamil. One needs to return to the romanized text mode to make the corrections if any. In a more recent version of this software called THIRU, the author provides a split screen, where the roman text being typed in the bottom half of the screen is continuously recast in the upper half in Tamil. ADHAWIN is another recent implementation of the same software but for Windows-based PCs. The transliteration scheme used in MADURAI is a subset of that used in ADAMI/ADHAWIN. The software operation used here is part of a general classification scheme called "romanized input/interpreted output" package.
For those who never wrote extensively in Tamil (and beginners who are not sure of exact uyirmei to use in writing tamil worlds, e.g. na/Na), word processors that allow transliterated input is attractive. Adami, Madurai, ITrans, XLibTamil softwares mentioned earlier to this category. The last three freewares are popular amongst the UNIX user community. They are being used widely to make tamil-related postings in USENET newsgroups. Used in conjuncture with corresponding meta-fonts and TeX-type word-processing extensions, high quality print-outs can be obtained for the tamil texts.
MURASU, ANJAL word-processing packages widely used in Malaysian, Singaporean Tamil Newspapers and Magazines are the products of Muthu Nedumaran present at this conference. These packages belong to the group of "romanized input/interpreted output" tools. The inaimathi and related fontfaces used in these pacakges are of the 8-bit bilingual type. The first 128 (0-127) slots are filled by roman characters as in basic ASCII and the tamil characters occupy the upper ASCII slots (128-255). By invoking the keyboard editor it is possible to access either of these two blocks. In the tamil typing mode, the roman keyboard strokes and their relative sequence are continuously interpreted to present equivalent Tamil characters on screen. Thus you type 'kathai' to get the equivalent tamil word.
Word Processors based on phonetic keyboard input (interpreted output)
There are now available intelligent tamil word-processors where the large number of uyirmei alphabets are obtained by a sequencial keying of the corresponding mei and uyir characters. Thus the keystrokes for consonant k followed by vowel i leads to appearance of compound character ki. Keyboard layouts of this kind have been called "phonetic". There are no characters for kokki's kombu's etc. The keyboard driver does the mapping and remapping based on the sequence of keypress events. An advantages of this approach is that the number of keys to use to get all the uyirmeis are considerably less. Mohan Tambe (formerly of the Centre for Developments for Advanced Computing CDAC) was one of the early pioneers working on the keyboard layouts appropriate for indian languages. His phonetic keyboard layout known as INSCRIPT was initially designed (in 1983) for Devanagiri script input. This has been adopted for use in the multi-lingual word-processors CDAC developed for indian languages (cf. references to CDAC and Inscript in the next section).
THUNAIVAN word-processsor of Ravindran Paul, IE PHONETIC KEYBOARD LAYOUT of Naa. Govindaswamy, CHARACTER PHONETIC DEPENDENCIES/YARZAN keyboard editor of R. Shanmugalingam are different forms of implementation of this phonetically based keyboard input concept. When compared to wytiwyg keyboard layouts, phonetic layouts reduce considerably the number of keystrokes required to get the o-kara, oo-kara, ou-kara uyirmeis.. Here you will type k
followed by o or O. Two keystrokes give 3 characters.
Place of Tamil in Multilingual word processing packages
Thanks to advances in the design of faster memory chips and compact high capacity storage devices, computers with Gbs of storage (hard disc), Mbs of RAM memory and high speed (>200 MHz) are already available at very affordable prices for the general public (at leat in the western world, if not in India). To make full use of this capability, multilingual packages are being developed that allow preparation of documents containing scripts of more than two or three languages. Several thousand characters corresponding to ten or more languages are bundled up in a single 'super-font' and appropriate software allows selection of one of these languages from a pull-down menu. The viability of this approach has already been demonstrated in the multi-linguage kit covering all the European languages, Greek and Turkish for Windows 95/NT environment. Microsoft currently distributes 'free' a font face containing 800+ characters and also a software to use along with it. Indian languages are not yet included in this multilanguage kit of Microsoft. MtScript (developed by Univ. of Aix-en-Province under support of French CNRS) is a multi-lingual text editor (for UNIX running Solaris) that enables using several different writing systems (Latin, Arabic, Cyrillic, Greek, Hebrew, Chinese, Japanese, Korean, etc.) in the same document. All of the languages defined in the ISO 8859-X schemes are supported in this package.
Unicode Consortium is currently working on a world-language standard character set (ISO 10646) for future use for multi-lingual wordprocessing. Unicode 2.0 version currently under discussion proposes specific slots character assignments for world languages including all indic languages (devanagiri, gurmukhi, tamil, telugu, malayalam, kannada,....). Muthu Nedumaran's paper at this conference dwells in to the details of implementation of this Unicode package. So we will not go into its detail except make a few remarks on the implications of the proposed character set (font encoding scheme). It was mentioned earlier, that, 8-bit fonts allow a large tamil alphabets (100 or more) stored in their native form and this in turn allows high quality production of printed tamil texts required for commercial publications. Unicode character set for Tamil has the bare minimum (64)- vowels, consonants, tamil numerals and a handful of modifiers to add to the consonants to get the compound (uyirmei) characters. None of the uyirmeis have been allocated any slot. If one uses only the above minimal character set, many of the uyirmeis have to be written in new forms (e.g, write pu, mu, puu, muu using the same right modifiers that are added to grantha letters ha/sa to get hu/su or huu/suu). Writing many of the uyirmeis in this new form (and deleting all the currently used structural forms/glyphs) in essence, amounts to introducing drastic language reforms - reforms in the way the script of the language is written currently.
In a parallel development to Unicode, the Dept. of Electronics of the Govt. of India has been developing standards for computing in Indian Languages (including Tamil) for over a decade. The primary tool is Graphic and Intelligence based Script Technology (GIST), a phonetic based computing technology. Center for Development for Advanced Computing (CDAC) based in Pune is the organization in India engaged in developing multi-lingual computing tools based on the GIST technology. Mohan Tambe (working intially at IIT, Kanpur, later as the Head of the GIST group at CDAC, Pune) is the brain behind the major multi-lingual computing projects for indian languages in India. The 1986 proposals of DOE for possible font encoding standards were revised by the Govt. of India in 1988 and were adopted as the 'national standard' under the name "Indian Standard Code for Information Interchange (ISCII-88). The early version of the Unicode apparently was modelled on the ISCII-88 standard. As in the Unicode scheme, the basic characters defined in the ISC character set is graphics characters as (in Hindi) Anuswar, Visarg, a set of vowels, set of consonants and vowel signs. The display rending and formation of conjuncts is left to the softwares meant for such purpose. Along with the ISCII standard for font encoding, the "phonetic keyboard layout" of Mohan Tambe has been adopted under the name INSCRIPT as the national standard for keyboard layout. The GIST technology works in the 8-bit mode where the tamil (or any indian language ) characters are placed in the upper ASCII slots 160-255 (actually 79 characters/glyphs). The entire lower half and the line drawing character set in the upper half are left undisturbed for English so that bilingual documents consisting of English and the indian language can be readily prepared. CDAC markets several products for multi-lingual computing based on this GIST technology. The phonetic/inscript keyboard designed by Mohan Tambe is used in all of the CDAC/GIST packages. Apex Language Processor (ALP), ISM (ISFOC Script Manager), LEAP (Language Environment for Aesthetic Publishing) are some of the multi-lingual word processors sold by CDAC directly or through its franchises. Popular word-processing package SHREE LIPI of Modular Systems is another commercial version of the package. LEAP is a multiscript word processing package for windows (like MS-WORD) that allows comparing texts in all indian languages. This is a cost effective solution for marketing/advertising agencies where trade literature giving details of the products can be given in all indian languages one after the other.
Apple has released very recently for Macintosh computers, a premier version of its 'INDIAN LANGUAGE KIT (ILK)' . This package contains fonts/software for word-processing in Devanagiri and Gurmukhi. It has been stated that the ILK package is modelled on the ISCII standards of the Govt. of India. INSCRIPT is the generic name given for the keyboard layout specifically designed for input of indic languages. ComStar of Cupertino, California, USA markets multi-lingual Word Processors called Gamma UniType and UNIVERSAL WORD FOR WINDOWS that allows preparation of a multi-lingual text and the package supports a large number of world languages including tamil. WordMate (also of ComStar, Inc) is a multi-lingual versatile software /keyboard driver that enables the user to tupe any of a long list of languages directly into virtually any windows application.
The Multilingual directory of the Internet lists the following softwares currently available for multi-lingual word-processing including tamil: Allwrite (of ILECC), Chitralekha (of Modular systems), Apex Language Processor (ALP) and ISM (ISFOC Script Manager of CDAC, Pune), Amicus (of Amicus), Gamma UniType and Multilingual Scholar (of Gamma Productions), Kalam (of Solustan Inc), LEAP (Language Environment for Aesthetic Publishing, of CDAC, Pune), Prakashak (of Sonata), Swadesh(of Institute for Typographical Research), Vision Publisher (of Vision Labs).
Proposals for standardisation/font encoding for Tamil should taken into account the mode of functioning of these multi-lingual word-processors. It would be unwise and non-practical to have different world standards for tamil - one for mono/bilingual usage within the 8859-X scheme and one for multilingual packages. Click here to go to continuation of this paper.