Home Conference List Conference on Language and Technology Conference on Language and Technology 2012 Article Details

A Dictionary Based Urdu Word Segmentation Using Dynamic Programming for Space Omission Problem

Abstract

Text segmentation is a process of dividing a sentence into its constituent words. For Natural Language Processing, Word Segmentation is an initial and obligatory step. Research in word segmentation has been done in different languages like English, Dutch, Chinese, Norwegian, Swedish and much more but this research focuses on Urdu language. Unlike English language, words in Urdu language are not always separated by spaces and spaces are not consistently used, which gives rise to both space omission and space insertion errors in Urdu. Space omission and space insertion error is the major challenge for segmentation task. This paper discusses the problems of Urdu Word segmentation and also suggests a solution to the space omission problem and space insertion problem. First, the clustered words are segmented and then each clustered word is divided into valid word. We use dictionary for marking word boundaries and for validating that the word is segmented correctly. This technique can be used for any application of Urdu text. This work has been tested on words collected from Geo1 , Jang2 , BBC3 news sites and other online documents available on internet. The proposed solution is tested on 11,995 words and the result is around 97.2%.

Download

Cite this article

Rabiya Rashid, Seemab Latif. (2012) A Dictionary Based Urdu Word Segmentation Using Dynamic Programming for Space Omission Problem, Conference on Language and Technology 2012.

Viewed 1518
Downloads 270

Publisher

Center for Language Engineering

Country

Pakistan

City

Lahore

From

09-11-2012

10-11-2012