Home Conference List Conference on Language and Technology Conference on Language and Technology 2012 Article Details

CLE Urdu Digest Corpus

Abstract

The paper presents design scheme and details of the first large publically available corpus of Urdu language. This includes the collection and cleaning techniques for the first 100k derivative of the larger corpus and the issues related to corpus design such as size, genres along with their ratio. The same design and techniques are being scaled to develop larger derivatives of the corpus with 500k, 1000k and 5000k words. The corpus, due to its public license, will significantly contribute towards linguistic and computational aspects of Urdu analysis.

Download

Cite this article

Saba Urooj, Farah Adeeba, Sarmad Hussain, Farhat Jabeen, Rahila Parveen. (2012) CLE Urdu Digest Corpus, Conference on Language and Technology 2012.

Viewed 1570
Downloads 312

Publisher

Center for Language Engineering

Country

Pakistan

City

Lahore

From

09-11-2012

10-11-2012