Abstract
The paper presents design scheme and details of the first large publically available corpus of Urdu language. This includes the collection and cleaning techniques for the first 100k derivative of the larger corpus and the issues related to corpus design such as size, genres along with their ratio. The same design and techniques are being scaled to develop larger derivatives of the corpus with 500k, 1000k and 5000k words. The corpus, due to its public license, will significantly contribute towards linguistic and computational aspects of Urdu analysis.

Saba Urooj, Farah Adeeba, Sarmad Hussain, Farhat Jabeen, Rahila Parveen. (2012) CLE Urdu Digest Corpus, Conference on Language and Technology 2012.
  • Viewed 1570
  • Downloads 312
Publisher
Center for Language Engineering
Country
Pakistan
City
Lahore
From
09-11-2012
To
10-11-2012