CLE Urdu Digest Corpus

تلخیص

The paper presents design scheme and details of the first large publically available corpus of Urdu language. This includes the collection and cleaning techniques for the first 100k derivative of the larger corpus and the issues related to corpus design such as size, genres along with their ratio. The same design and techniques are being scaled to develop larger derivatives of the corpus with 500k, 1000k and 5000k words. The corpus, due to its public license, will significantly contribute towards linguistic and computational aspects of Urdu analysis.

Download

برائے حوالہ

Saba Urooj, Farah Adeeba, Sarmad Hussain, Farhat Jabeen, Rahila Parveen. (2012) CLE Urdu Digest Corpus, Conference on Language and Technology 2012.

Viewed 1568
Downloads 312

پچھلا مقالہ

اگلا مقالہ