Abstract
While the World Wide Web is an attractive resource,
few researchers can access or manage a Web-scale
corpus. Instead they use search-hit counts as a
substitute for direct measurements on a web corpus. In contrast, one can download a small high
quality corpus like Wikipedia and carry out exact
measurements. By extensive experiments with multiple word-association measures and several public
datasets, we show that for exploring document level
co-occurrence based word associations, despite being three orders of magnitude smaller in size, the
Wikipedia is a reasonable alternative to a web corpus that can only be accessed using search engines.
Further, with Wikipedia, one can carry out measurements at a granularity finer than document
scale. Instead of document level co-occurrence,
one can consider a word-pair occurrence significant, only if the two words occur within a certain
threshold distance of each-other. In general, such
fine-grained information cannot be obtained from
search engines. Our experiments show that the
word level co-occurrence measures perform better
than the document level measures. This indicates
another practical advantage of the Wikipedia, or
any other downloadable corpus, over a Web corpus
which can only be accessed using search engines.
Om P. Damani, Pankhil Chedda, Dipak Chaudhari. (2012) Wikipedia is a Practical Alternative to the Web for measuring Co-occurrence based Word Association, Conference on Language and Technology 2012.
-
Viewed
1498 -
Downloads
136