Natural Language Processing (NLP)

Introduction to Language Technology at GIST

C-DAC GIST has always been at the forefront of the development of new tools and technologies. A leader in the area, the GIST Labs have carved their expertise with technologies as varied as Natural Language Processing (NLP), Video, Embedded Systems, Word-processing to name only a few. This tradition of cutting-edge technologies is continually upheld at the GIST Labs where new tools compatible with the needs and requirements of today's fast developing digital world are being developed.

Some of the major technologies, which underlie the development of new tools and applications, are showcased below. The areas are varied and have been classified based on their foci of interest.

Natural Language Processing Technologies
The new Web is based on Natural Language Processing, which aims to bring humans and the digital world closer. Doing away with statistical tools that at best could emulate Human Machine Interface in a narrow manner, Natural Language Processing (NLP) is the new area where the major developments of W3C will be undertaken. To ensure that Indian Languages are on this new platform, exciting and new technologies are being developed.

C-DAC GIST MT (Machine Translation) Evaluation Tool
Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains." A definition from the European Association for Machine Translation (EAMT), "an organization that serves the growing community of people interested in MT and translation tools, including users, developers, and researchers of this increasingly viable technology"

Click to view details and snaphots >>

GIST Visual Thesaurus
GIST Visual Thesaurus represents thesaurus based word map for the entered word in interactive and easily explorable manner. Its unique and attractive graphical visualization of word map and word net makes it easy tool to use and increases the learning thrust. It allows the word inputting in Unicode for Indian languages. Currently Hindi and Gujarati are supported. It is targeted to one and all who want to replace an idea with a word and also to those who want to explore and learn language. 

Click to view details and snaphots >>

Spell-Checkers
GIST has to its credit the development of the first Indian Languages spell-checkers both under DOS and WINDOWS. The next generation of spell-checkers and algorithms are a new and dynamic algorithm permitting for a faster and more efficient spell-check. The existing dictionaries have been upgraded and the new dictionaries are richer and have more words and have been updated to suit the requirements of the present day world where spell-checkers are needed for the Web.

Spell-checkers are available as a stand-alone utility and will accept data in 8 Bit ISCII/PASCII as well as Unicode, UTF8, Big-Endian and Little-Endian

Click for more info >>

Grammar Checkers
Grammar checkers are a must in India and can be used not only to validate incorrect grammar within text but also and more importantly, permit the user to ensure that the correct grammatical forms have been used. The tool can also be used by school children to master the intricacies of Indian language grammar.

The checker handles the following cases:

1. Concord at the N.P. Level

Variable Adjectives of Quality Noun
Poss Noun
Relative Quant Noun
Reflexive Noun
Reciprocal Noun
Interrog Noun
Indefinite Noun
Comparative Noun
Relative Noun
Ordinal Noun
Collective Noun
Fractions Noun
Multiplicative Noun

2. Concord at the V.P. level

The verb Group admits two types of checks
1.An Intra verb checker: which handles relations within the verbal string and will be indifferent to the nature of the relations between the VP and the NP
2.VP-NP checker : which applies concord rules between the Verb and the Noun. The VP-NP checker will handle all relational issues between the Verb and the N.P. it governs

3. Concord between NP (Subject) and V.P.
The checker validates Number, Person, Case, Gender rules between the Subject and the Verb

4. Stylistic features which try to trap the most common errors committed by the native user

5. Fragments and Run-ons
A statistical analysis of readability in terms of Fleisch-Kincaid Index as well as statistical tools is also provided.

A prototype of a first-ever Grammar-checker for Hindi has been developed. The design of the checker allows for easy adaptation to other languages.

The Grammar checker accepts data in 8 Bit ISCII/PASCII as well as Unicode (Big-Endian and Little-Endian) and UTF8

Online Thesauri
Thesauri which provide much more semantic information than dictionaries are a vital tool for search-engines, data-mining and information retrieval. GIST has started work on the creation of a Thesaurus Building Engine, which will ensure that the structure of the thesaurus with its hyponyms and hyperonyms is correctly indexed permitting fast and quick information retrieval.

Transliteration Utillities
In a country like India where languages use scripts belonging to the LATIN (Konkani), PERSO-ARABIC (Sindhi, Kashmiri, Urdu), BRAHMI (a majority of Indo-Aryan and all Dravidian Scripts), transfer of content from one base to another, especially names is a requirement for E-Governance, Election Commission etc.

The first attempt to convert names types in English to Brahmi based scripts was in the form of NTRANS where a dictionary supported by strong heuristics allowed for transliteration of names from English. UTRANS permitted conversion of Hindi and Punjabi to Urdu.


Browser Toolbars to Transform English applications and websites into Indian Language  at the click of a button

 

Click for more info >>

Deploying a strong genetic algorithm and statistical tools a set of Transliteration Utilities are under development, especially to bridge the gap between Latin, Brahmi and Perso-Arabic platforms as shown in the table below:

 

Click for more info >>

Name Conversion Utilities
These utilities allow for conversion of names from one language platform to another.

Text Transliteration
These converters allow the user to transliterate a text typed in Urdu to Hindi or to see a text typed in Roman as Urdu

Online Dictionaries
Dictionaries are a valuable database in a country like India where Cross-Lingual Information Querying systems are urgently needed. They are also needed in areas such as E-Governance or Teaching Systems or Search-Engines. GIST has started work on developing dictionaries in joint collaboration with the Language Boards and Academies of the particular linguistic region. The dictionary database can be in the shape of a mono-lingual or bi-lingual database or it can be a dictionary of synonyms or antonyms or idiomatic expressions common to the language.

Since dictionaries are often made by hand using traditional indexes, a dictionary validation and building tool has been created to ensure that the dictionaries are properly indexed and that the maximum information within the dictionary is retrievable.

Homophone Engine
The Homophone Engine is a sophisticated tool which searches for look-alikes in Indian languages as well as in Indian English. The problems treated here are mainly pertinent to Indian names as written both in English as well as in Indian scripts. However they could also be extended to all alphabets and some examples show lacunae in script systems other than Indian.

Homophone Engine - Problem Statement
A few of the major lacunae in existing English based solutions are listed below:

a. Letter to Sound Relationship
With only 26 English Letters. It does not support any characters beyond basic 26 characters in English. Extended character sets are not supported hence names with unusual letters (like é) may not be retrieved correctly. Thus the name Barve will yield Barwe but not Barwé and Barvé.

b. First Character
Algorithms based on English depend on the first letter of the "tokenized word" to generate the key. Someone looking for Firoze or Fali will not get Phiroze or Phali. Not to mention instances of names generated under the influence of numerology such as KKarishma There would be a lot of False Negatives in these cases.

c. Typos
Typos and noise are a fact of system data input. If the operator typed "Katrik" instead of "Kartik" using the Key-based approach it will not be possible to fetch the "Kartik" that we are looking for.

d. Name Variants
Existing English based systems cannot handle either the multiple ways in which a name can be spelled. Thus Chaudhary is spelled in around 34 different ways, Soundex at best can trap around 14-15 and fail on the rest.

e. Homophonic names which are not homographs
Soundex and NYSIIS/Metaphone fail for names that use silent letters and silent sounds. Some examples would be:

f. False Correct Results
Compare the Soundex code for "Sunil". Over 100 other names will show up. All Soundex derived algorithms end up with these precision problems.

g. Name Sequence Variation
The British "First Name", "Middle Initial", "Last Name" style is not followed in the entire world. Name sequence variation is a cultural phenomenon and is widely spread in India. Some cultures have last name first and first name last. Other keep only the geographical name as their name and the "First name" is stored as an Initial.

h. Multi-Cultural Diverse Name Databases
A name spelled one way in one state is spelled and pronounced very differently in the neighbouring state. These problems exist within different cultures living in the same state. The problem is compounded by system user or operator who already knows a third spelling of the name. Thus whereas Oriya and a majority of Dravidian Languages will show the absence of the implicit vowel by a Halanta sign, Hindi or Gujarati does not use this notation but prefers that the final consonant has an implicit "a" which is not pronounced.

i. Abbreviated Name Variants
The Soundex Codes for "Bandopadhyaya" and "Banerjee" are not the same. Existing English Algorithms fail do retrieve these equivalent names. Similarly nicknames commonly used such as Vainu for Vainateya will not be mapped under a Soundex search. For example, the name Mohammad can be abbreviated as Md., Mmd., Mhd. or Mohd. There are such numerous examples of abbreviations.

j. Titles, Qualifiers may occur at much higher frequency in such scenarios the key-based approach becomes over-whelming. Dr. Prof.

k. Hyphenated name
A Soundex based algorithmic search for hyphenated names will not yield exact results:
Thus Abd-al-Razzaq ~ Abdul Razzaq ~ Abd-ur-Razzq will not be displayed in Soundex as variants of the same name.

Homophone Engine - Solution
The Solution developed by C-DAC tries to attack the problem from not only a homophonic approach but also from a Context Bound Name Grammar approach. Contextual rules adjuncted to Homophonic rules ensure that the result is neither over generative nor under-generative but provides at best a right fit. This ensures that Sunil does not map to the possibilities listed above but maps to Suneel, Soonil, Sooneel , Sunneil Suneil . Only exact and correct homophones/homographs including abbreviations, name variants are provided.

Below are given examples to showcase the application which at present is in a beta stage of testing: We have three options in place: Results for each are given below for two words: Chaudhury and Ebrahim

# chaudhary

     

chaudhaary

coudhary

chaudhary

chaoudhari

chaaudhary

chaudhaari

choudhry

chaodhri

chaodhary

chaaudhari

chaudhri

choudhri

choudhary

chaodhari

chudhari chodhri

chaudhhary

choudhari

chodhry chowdhry
chaudhari chodhari chaudahry choudhray

chodhary

chowdhary

chudhri

chaudahri

chaudahary

choaudhary

coudhari

chuadhari

chaudhry

choudharay

chauudhari

chovdhari
chudhary chaoudhary

chowdhari

chowadhari

chudhry chaudahari

choaudhari

chaowdhari

choudhaary chauadhari

chovdhary

chowdhri

 

# ebrahim

     

ibrahim

ebrahim

ibrrahim

ibrahahim

ebraheem ibraheem ibrahaim

ibrhaim

ebarahim ibraahim

ibarahim

ibarhim

eabrahim ibbrahim

iabrahim

ibrhahim
ebrhim ibrahhim ibrhim  

The HOMOPHONE ENGINE can be deployed in a large number of applications including Spell-checkers, Name Translation Utilities, Data mining applications (such as Election Commission, Telephone Directory search), IT databases where homographs need to be detected.

Lemmatisers
Lemmatisers are a must for higher-level Natural Language Processing (NLP), especially if the word has to be correctly tagged as to its categorical class. Lemmatisers have a wide range of applications in areas as diverse as Translation, Semantic Web, Data Mining, Natural Query Systems to name only a few.

In addition coupled with the spell-checker a typo is corrected and the lemmatized form of the typo is appended.
The Lemmatiser is available as a stand-alone utility and accepts data in 8 Bit ISCII/PASCII as well as Unicode (Big-Endian and Little-Endian) and UTF8.

Microconverters
Conversion of data to storage and vice-versa has been a requirement for the complex scripts of India. Even with the advent of Unicode, the need for Converters is strong. More so in the area of embedded tools and technologies where memory and speed are a must, converters are still needed. GIST has been at the forefront of this area and has to its credit the creation of ISFOC.

Ongoing research on script grammars has resulted in converters which are bi-directional and can move from storage to display and conversely with a single DLL. A single generic engine handles all the converters, which are extremely tiny in size. This allows for easy and fast conversion especially for embedded devices, which are memory hungry.

Converters for third-party fonts are also available. The converter rule file can be written by the font developer and is extremely easy to build thanks to a user-friendly GUI, which guides the user through writing the converter.

For more details, please contact:

More information on GIST products
E-Mail:
info.gist@cdac.in

Sales related information
E-Mail:
sales.gist@cdac.in

Support related information
E-Mail:
support.gist@cdac.in