Abstract
Text segmentation is a process of dividing a sentence into its constituent words. For Natural Language Processing, Word Segmentation is an initial and obligatory step. Research in word segmentation has been done in different languages like English, Dutch, Chinese, Norwegian, Swedish and much more but this research focuses on Urdu language. Unlike English language, words in Urdu language are not always separated by spaces and spaces are not consistently used, which gives rise to both space omission and space insertion errors in Urdu. Space omission and space insertion error is the major challenge for segmentation task. This paper discusses the problems of Urdu Word segmentation and also suggests a solution to the space omission problem and space insertion problem. First, the clustered words are segmented and then each clustered word is divided into valid word. We use dictionary for marking word boundaries and for validating that the word is segmented correctly. This technique can be used for any application of Urdu text. This work has been tested on words collected from Geo1 , Jang2 , BBC3 news sites and other online documents available on internet. The proposed solution is tested on 11,995 words and the result is around 97.2%.

Rabiya Rashid, Seemab Latif. (2012) A Dictionary Based Urdu Word Segmentation Using Dynamic Programming for Space Omission Problem, Conference on Language and Technology 2012.
  • Viewed 1487
  • Downloads 260
Publisher
Center for Language Engineering
Country
Pakistan
City
Lahore
From
09-11-2012
To
10-11-2012