Webpages of Tamil Electronic Library © K. Kalyanasundaram |
PROJECT MADURAI : An online Resource
Madurai Thamizh Ilakkiya Minthokupputh Thittam URL: http://www.tamil.net/projectmadurai)
Paper presented at the TAMILNET99 International Tamil Computing Conference held in Chennai during 7th and 8th Feb 1999.
Project Madurai is an open, voluntary initiative devoted to electronic texts (ETexts) of Tamil literary works. This paper provides a broad overview of the working of this project, Etexts released in the first year of its existence and Etexts that are under preparation. Looking into the future, the paper also raises some of the key issues relating to Tamil EText archives that are to be sorted out, issues such as copyrights of very ancient works, standardization of file format and distribution modes, dedicated OCR for Tamil and an international registry for Tamil Etexts.
Electronic versions of printed texts (abbreviated as ETexts) of literary
works are important pedagogic and scholarly resources. Stored in easily
accessible archives, they permit preservation and wider distribution of the
literary works around the globe through the means of Internet. A digital Library
of electronic texts is an elegant approach by which millions of Tamils and
Tamil lovers worldwide can access Tamil literary classics right from their home
and hence maintain close contact with their culture.
Tamil Etexts and Etext Archives
One can cite several reasons as to why Tamil Language badly needs an extensive public access Etext Archive. As an ancient language dating back to at least two thousand years, it has a very rich repertoire of literary contributions. Unfortunately, a good part of these works remained in palm-leaf manuscript level and never unpublished in printed form. 20th century Tamil literature is largely dominated by novels and associated decrease in the sale of printed copies of ancient literary works covering other domains. With dwindling market, key publishing houses such as Saiva Siddhantha Trust are dropping plans of reprinting classics. A good number of old works have not been reprinted for several decades and hence out of print.
Comprehensive collections of Tamil works ever published are now restricted to a handful of libraries both within and outside India. Due to nature of the printing paper used in olden days and poor storage conditions, many of the published books are being eaten away by insects. The burning out of the entire book collections of major Tamil library of Jaffna in a fire accident is a stark realization of major risks that Tamil faces in loosing its rich literary treasures confined to few places. The following scrolling logo that we use in the Project Madurai summarizes succinctly the benefits of an Etext archive: "kAlatAl aziyAtatu, koTutAl kuraiyAtatu, centaNalAl vEkAtatu mintokuppu. tamizh ilakkiyangkalai nanrE kAttu, ulakengum parappa inrE cEruviir maturai tamiz ilakkiya mintokuppu tittam". We all have a moral obligation to preserve our rich heritage and ensure that the future generation do have access to literary works, possibly via better means of archival and world-wide distribution.
Dr. Thomas Malten of the Univ. of Cologne in Germany possibly is the pioneer to engage in preparation of Etexts of Tamil literary works. Over a decade, he and his colleagues at their Institute of Indology and Tamil Studies prepared Etexts of practically all major works of the Sangam period. Before the advent of modern Tamil computing directly in Tamil scripts, transliterated format was the mode of writing on computers and the Etext archives of Cologne are also in this transliterated version. The Etexts themselves have not made available to the general public (till today). However WAIS-based word search of the archived texts have been available through gopher and Web servers for many years. In collaboration with Prof. George Hart of UC Berkeley, Dr. Malten is currently working on a major Project called "PONGAL 2000" through which their Etexts collections will be made available (hopefully in Tamil Script). The preparation of Etext of Naalayira divya prabhandam (in transliterated format) in early nineties by a group of volunteers led by Prof. P. Dileepan can also be cited as one of the early successful examples of Internet-based voluntary efforts in this domain.
With the availability of free Tamil fonts in the Internet starting from mid-nineties, there has been a phenomenal increase in the amount of Tamil-related materials in the Internet. Through my Tamil Electronic Library web site launched in early 1995, I have been distribute Mylai font free for use on Windows, Mac and Unix platforms. Along with the font, I distribute Etext files of small literary works, particularly of devotional type such as thirukural, kandar shasti kavacham, auvaiyar works, select collections of thevaram, etc. Through postings in the soc.culture.tamil (USENET) and tamil.net (Email discussion list), many expressed the desire to start a collective effort, targeting ancient literary classics. Based on recent calls in tamil.net, this initiative officially took off on the Pongal day of Jan. 14, 1998! After some collective discussions, "Project Madurai" (PM) was taken up as the name to describe this Internet-based Voluntary Initiative. (The abbreviation PM is used in rest of this paper to refer to Project Madurai.). Madurai has been a citadel of Tamil culture. It was Pandyan Kings during their long reigning period, who set up Sangams (academies) for the encouragement and criticism of Tamil Studies. Hence it is appropriate that an Initiative devoted to preserving and distributing Tamil literary works in electronic form is named after this historic city.
Modus Operandi of Project Madurai
Project Madurai is based on voluntary cooperation between lovers living in several countries. It is planned to use both direct text input and OCR of scanned images to generate Etexts. In the absence of a dedicated OCR, currently Etext preparations are through direct input by the PM volunteers. (Univ of Chicago recently opted to microfilm the entire 50000+ book collections of Raja Muthiah Library of Chennai. Using a dedicated OCR system, it should be possible to generate the Etexts from such microfilms in a short span of time!) The Etexts are subsequently proof-read by some other PM volunteer before they are archived and released to the general public. Completed Etexts are made available through a dedicated web site << www.tamil.net/projectmadurai/ >> hosted by Asia Pacific Internet Company (APIC) of Sydney, Australia. The Etexts are distributed in plain text format and also as formatted texts in the form of Web pages (HTML version). Portable Document Format (PDF) Distribution is becoming increasingly popular as a mode of distribution of formatted texts in the Internet. Free Acrobat readers are available to read pdf files in all commonly used computer platforms. Distribution of Tamil Etexts in pdf format has just begun for select works. When the Etext collections become substantial, it is planned to make them available in CD-ROMs.
As a grass-root Internet-based effort devoted to free distribution of Tamil Etexts, PM belongs to any one committed to the stated goals of the Project. Entire coordination and exchange of files are through standard Email. It suffices to have a simple Email connection for anyone to be engaged in PM activities and there is no need to have Internet connection. Operating rule are set by the volunteers themselves. Project execution and evolution are discussed routinely through a dedicated Email discussion list <
Considerable flexibility is given to the volunteers in the choice of the work he/she chooses to do text input and also the font encoding format. The goal is to let the volunteers work in an environment (font and computer) he/she feels comfortable with and the work where there is some personal interest. Very minimal constraints if any will be imposed on the volunteers who will do the major task of keying in of texts. Appropriate acknowledge is given to the volunteers involved in the web pages announcing the availability of a given Etext and also in source files. In addition to the preparation of Etexts, PM volunteers are also involved in related activities such as development of a dedicated OCR for Tamil, softwares for inter-conversion to other font encoding formats etc. Through these converters it is possible to provide Etexts in at least five different font encoding formats (typewriter, romanized/transliterated, Adhawin, Inaimathi/Anjal and Mylai). This renders the access of Etexts to a much larger Internet user community.
Font Encoding for Etexts
In Project Madurai, we are committed to using an internationally recognized font encoding standard for the archives. In the first year, we chose to work with following two font formats for primary file archiving: Inaimathi/Anjal and Mylai. The reasons for limiting the choice to these formats were the following: a) Over 90% of the Tamils worldwide use one of these; b) fonts are available free for use on all of the three major computer platforms - windows, Macintosh and UNIX. Anyone who uses these computers can work in all of these formats; and c) converters such as Adhawin and Anjal are available that work reliably to go between these formats and to several others mentioned above. So we can provide equivalent form of Etexts in typewriter and romanized/transliterated formats to interested parties. An important added component to the Etext is an introductory note (presented as a Web page) that sets the background for the work and author. It is written in a simple style so as to be easily understood by native Tamils and also by language lovers/researchers.
Selection of Works for Etext Archives
Project Madurai is deeply committed to respect Copyright protection given to authors of literary works. Even though the copyright rules vary from country to country, in most of the Etext archiving projects, an elapse of at least 75 years after the life span of the author is considered a safe criterion before the work can be considered to be in public domain. (In the USA, Sonny Bonno Bill approved by the Congress late in 1998 has extended the author protection to a total of 95 years). So, as a rule of thump, we can consider works of authors of 19th century and earlier dates to be in public domain. Selection of works for Etext preparation is primarily dictated by this date constraint. Hence is the tilt for archiving ancient literary classics. Another reason for going for ancient literature is that most of them are out of print and stand the risk of getting lost to the world.
I. Works of popular appeal and devotional literature:
II. Great anthologies of:
III. Works of 20th Century Authors:
B. Recent Authors who give permission: 20th Century Literature is largely dominated by novels. One of our PM volunteers Mr. Gandhi Kannadhasan has been lobbying with leading novelists, encouraging them to place at least one or two representative novels in the electronic archives. The following is a short list of modern authors who have expressed willingness to participate in this process of putting select works in Etext form: Kannadhasan (through legal heirs), Akilan, Naa Paarthasarathy (through legal heirs), Prabanjan, Indumathi, Rajam Krishnan, Indira Parthasarathi, Malan and Jeyakanthan.
IV. Translations and Commentaries of Literary Works:
V.Literary works of Authors of Srilankan Origin:
Worldwide Network of Volunteers
When the project was launched a year ago, we started with a small group of about 25 volunteers interested to get involved. In one year the ring of volunteers has expanded to count 80 and they all come from four corners of the world: U.S.A [from the states of ca, nm, tx, ka, il, in, ga, oh, vi, ny, nj and nh], Canada [states of quebec and regina], Western Europe [London (UK), Cork(I), Kiel (G), Lausanne (CH), Helsinki (Fi)], Gulf States [Riyadh, Abu Dhabi], India [from Chennai, Trichi, Katpadi, Coimbatore, Bangalore, Bombay, Delhi, Baroda, Kanpur], Sri Lanka [Colombo], Southeast Asia [Singapore, Malaysia] and Australasia [from cities of Sydney, Wellington and Auckland].
Etexts Completed as of Date
The following is a list of works for which we have Etext preparation nearly complete (many of them indicated by * already released to the public) thirukuRaL*, auvaiyar works*, thiruvaachagam*, thirumanthiram*, naalayira divya prabhandam*, thaNNir dEsam* of Vairamuthu, Bharathiyar songs-part I*, dEsika prabhandam* of vEdAntha dEsikar, Bharathiyar Tamil translation and commentary* of Bhagavad Gita, naLaveNbA* of pukazhendip pulavar, nAladiyAr*, 9th thirumurai (thiruisappas), 11th thirumurai covering works of kAraikAl ammaiyar, nakkiirar, kapilar, nambi AndAr nambi, English translation of ThirukuRaL by Yogi Suddhanantha Bharathi*, anthology of contemporary literature of Tamil immigrants (particularly those of Sri Lankan origin), Holy Bible - New Testament in Tamil (part I)*, thiruvarutpa of rAmalinga adigaL and pathiRRupaththu
Etexts in Preparation
The following is a short list of Tamil works that are currently under preparation: paadalkaL of ciddhars pattinathAr, pathirakiri and others, works of Bharathidaasan, thiruppuhaz of aruNagiri, ciRappurANam of umarup pulavar, English translation of thiruvAcakam by Rev GU Pope, Tamil commentary of thirukuRaL by parimElazhakar, nAttuppurap pAdalkaL /Tamil folk songs, kamba rAmAyaNam, Tamil translation of Kalevala (great Finnish Epic) by UthayaNan, Modern Tamil Writings of Sri Lankan Authors in Exile, Bharatha Shakti Maha Kaviyam and other works of Yogi Suddhanantha Bharathi, select works of Kavinjar Kannadhasan, Thampikki aNNAvin kadithangaL of CN Annathurai, cilappathikaaram,...
General Issues Related to Tamil Etext Archives
As part of this presentation, I would like to discuss some important issues related to Tamil Etext Archives hoping to find some response from the participants of this conference and also from the Tamilnadu Govt. First issue is related to the question of Copyrights: Which of the ancient works are in the public domain? It was mentioned earlier that Berne Convention of Copyrights signed by many countries (including India) confers protection to authors (and their legal heirs) to 75 years after the passing away of the author. In USA Sonny Bonno Bill has extended this limit to 95 years. As per this reference, works of authors who left us by the beginning of this 20th C are in public domain. In Tamil, there have been contradictory reports that even works of Sangam period are still under copyright. Promoters of Etext archives are very much concerned by claims of this kind and have delayed public release of Etexts for several years. Herein I would like to propose some suggestions, particularly to works that date to 19th Century or earlier periods. I must emphasis that there are purely my personal views and I am not a legal expert to make definitive statements.
A significant number of works dating back to several centuries were never published in the printed book form that we are familiar with. Most of them were at the palm-leaf manuscript level. Only in recent times there have been systematic efforts to publish these very ancient works of Tamil. Thamizh thaththA U.Ve. CuvAminAtha Aiyar was a pioneer in this area and has been responsible for several critically edited versions of Sangam period works. Clearly such critically edited, recent publications of ancient works are subject to copyright, even if the original work may date several hundred/thousand years back. Maiden publications of hitherto unpublished ones (for e.g. works that are still at the palm-leaf manuscript stage) are also subject to copyright by the recent editor and publisher. Institute of Asian Studies, for example, is an Institution committed to publishing ancient works from palm-leaves. Recent commentaries of ancient works are also subject to copyright, so also the recent translations to English and other languages of ancient works. On the other hand, very ancient works such as thirukuRaL that have been repeatedly reprinted by several publishers during this century are possibly in public domain. There should also be no constraints for someone to extract the source/moolam verses of ancient works from recent commentaries and public the original work alone in Etext form. The tasks of promoters of Etext Archives will be largely facilitated if the Tamilnadu Govt. appoints a Special Experts Committee to look into this question. Working along with the Book Publishers Association of Tamilnadu, they make definitive statements on which of the ancient works are indeed in public domain for anyone to freely reproduce electronically or otherwise.
A second important issue is Standardization of the file/data format in Etext Archives. There are several isolated efforts currently underway all aimed at archiving works electronically. In the absence of any clear standards, each archive is using its own font-encoding and file storage formats. Institutions such as Univ of Pennsylvania interested in Linguistics are interested in "Corpus" text where every word in the Etext is tagged. In view of the growing popularity of the Web, formatted texts are made available in HTML formats. Librarians, on the other hand prefer that such formatted texts are in SGML format. Efficient working of Internet-based search engines require some standardisation implemented in these archives.
Transliteration scheme is another area where standardization needs to be enforced. Etext archives of works in transliterated format is still of interest to researchers of non-Tamil origin. On the formats for transliteration, the classical scheme using diacritical markers (dots and bars below and above) is used by Library of Congress in its cataloguing and also by the Inst. of Asian Studies in their Lexicon and publications of palm-leaf manuscripts. In view of the practical problems in using such a scheme in electronic communication/exchanges, alternate schemes based on plain ASCII characters (with or without upper case characters) are being used increasingly. At least five different popular schemes in use. Here again the Special Committee of the Tamilnadu Govt. can play important role in setting standards.
A dedicated Optical Character Recognition (OCR) package for will assist enormously the rapid processing of scanned images (even those from microfilms) and generation of Etexts. Currently we do not have any such OCR software for Tamil. Urgent need for such an OCR was recognized in the last TamilNet'97 conference held in Singapore but till date, there has not been any systematic follow up. In PM, we have a group of volunteers (all computer software professionals) are working to develop one such OCR package and make it available free to the general public. With growing interest to Tamil Computing, it is likely that commercial softwares will appear soon in the market. Voluntary efforts such as that of PM can benefit enormously if a polyvalent OCR is made available free.
Cooperation and coordination between all individual and institutional efforts engaged in Etext preparation are necessary to avoid duplication in the preparation of Etexts. A central/international registry of Etexts (public access or otherwise) can help the cause. A central/international registry of new projects of Etexts being undertaken (where anyone can find out if someone is working on a given work) is also helpful in coordination. During its first year, in Project Madurai we have found many committed lovers who are willing to get involved in preparation of Etexts of works. Most of them are based overseas and they do not have access to a printed copy of the work to start the text input process. A Chennai-based Book Bank, for example, can help the PM cause and speed up the process.
Project Madurai is a concrete example of how like-minded souls can work collectively for a common cause even if they are physically located in four different corners of the world. Email communication is a marvelous gift of the electronic age and we are successfully using it. For many of us, engagement in PM efforts rekindled our interest in Language and its rich Literature (culture in general). Millions of Tamils living outside India do have access to Internet already and many of them are thrilled by the fact that they can go to Internet and download Etext files of Literary Classics. Particularly for those placed in remote cities/countries far off where they cannot easily get printed copies of books even if they want to, Etext archives are wonderful gifts. In this aspect, Etext projects such as PM do fulfill their stated goal of promoting preservation of Culture worldwide. In conclusion, I would like to record my deep appreciation and gratitude to all the PM volunteers for their enthusiasm and hard work.
Interested to be a volunteer?
Here are some webpages that list the etexts available as part of Project Madurai archives.
alphabetical listing of etext releases in TSCII format (akara varisai paTTiyal)
subjectwise listing of etext releases in TSCII format
chronological listing of etext releases in Unicode format