Home Conference List Conference on Language and Technology Conference on Language and Technology 2012 Article Details

Wikipedia is a Practical Alternative to the Web for measuring Co-occurrence based Word Association

Abstract

While the World Wide Web is an attractive resource, few researchers can access or manage a Web-scale corpus. Instead they use search-hit counts as a substitute for direct measurements on a web corpus. In contrast, one can download a small high quality corpus like Wikipedia and carry out exact measurements. By extensive experiments with multiple word-association measures and several public datasets, we show that for exploring document level co-occurrence based word associations, despite being three orders of magnitude smaller in size, the Wikipedia is a reasonable alternative to a web corpus that can only be accessed using search engines. Further, with Wikipedia, one can carry out measurements at a granularity finer than document scale. Instead of document level co-occurrence, one can consider a word-pair occurrence significant, only if the two words occur within a certain threshold distance of each-other. In general, such fine-grained information cannot be obtained from search engines. Our experiments show that the word level co-occurrence measures perform better than the document level measures. This indicates another practical advantage of the Wikipedia, or any other downloadable corpus, over a Web corpus which can only be accessed using search engines.

Download

Cite this article

Om P. Damani, Pankhil Chedda, Dipak Chaudhari. (2012) Wikipedia is a Practical Alternative to the Web for measuring Co-occurrence based Word Association, Conference on Language and Technology 2012.

Viewed 1498
Downloads 136

Publisher

Center for Language Engineering

Country

Pakistan

City

Lahore

From

09-11-2012

10-11-2012