Computational linguistic research for Indic languages
There are many languages in the Indian subcontinent that have speakers ranging from tens of millions to hundreds of millions. For most of these languages, not much computational linguistic work has taken place. Without computational tools, even widely spoken languages can become useless and “defunct” in this electronic age of the Web, Internet and smart devices.
The REU student should have some interest in languages of the region, at least in non-English languages. We have done initial work with one such language. Here are links to a couple of our papers: ACM Transactions on Asian Language Information Processing 2008 paper on morphology learning, ACL 2009 paper on POS tagging, ACL 2002 workshop paper on unsupervised morphology learning.
There are various ways the current work can be extended. Some ideas are given below. Note that we can test our initial ideas with English before we implement them for these languages.
- We are developing corpora (collection of documents) for these languages. We are also working on different essential computational linguistic work such as POS tagging, developing dependency tags, building dependency treebanks, etc. There is research to be done and tools to be developed in each of these areas. Here is a proposal we wrote recently to carry out some of the corpus building work. You can choose to work on any project listed here.
- Developing framework for a computational dictionary that can accommodate one or more such languages. The dictionary will be built from the corpora of documents.
- It is very difficult for most speakers of these langauages to enter data with the regular Roman (QWERTY) keyboard since the number of characters that need to be entered is large due to the presence of many diacritic marks and conjoints. In fact, an REU student Miguel Gonzalez developed a soft-keyboard for an Indic language in the Fall of 2009. Here is a paper we submitted to a conference that did not get accepted. We have written a proposal to NSF to extend the work reported in this preliminary paper. You can work on any one of the projects in this proposal. Don't be scared that it's in the context of an Indic language: You can try things out with English first. Consider working in the Indic language as the second language to test your ideas.
- However, with the availability of multi-touch screen computers, phones and PDAs, where one can design context-based soft-keyboards, we believe that things are changing. We are interested in designing smart soft-keyboards with predictive text entry for such languages. We will use machine learning techniques to develop soft-keyboards.