Abstract
Document clustering is an unsupervised approach in which a large collection of documents
(corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters).
Meaningful representation of documents and implicitly identifying the patterns, on which this
separation is performed, is the challenging part of document clustering. We have proposed a
document clustering technique using graph based document representation with constraints. A graph
data structure can easily capture the non-linear relationships of nodes, document contains various
feature terms that can be non-linearly connected, and hence a graph can easily represents this
information. Constrains, are explicit conditions for document clustering where background knowledge
is used to set the direction for Linking or Not-Linking a set of documents for a target clusters, thus
guiding the clustering process. We deemed clustering is an ill-define problem, there can be many
clustering results. Background knowledge can be used to drive the clustering algorithm in the right
direction. We have proposed three different types of constraints, Instance level, corpus level and
cluster level constraints. A new algorithm Constrained HAC is also proposed which will incorporate
Instance level constraints as prior knowledge; it will guide the clustering process leading to better
results. Extensive set of experiments have been performed on both synthetic and standard document
clustering datasets .Results are then compared on standard clustering measures like: purity, entropy
and F-measure. These clearly establish that our proposed approach leads to improvement in cluster
quality.
F. Amin, M. Raf, M. Shahid. (2016) Document Clustering Using Graph Based Document Representation with Constraints, Pakistan Journal of Engineering and Applied Sciences, VOLUME 18, Issue 1.
-
Views
2010 -
Downloads
183
Article Details
Volume
Issue
Type
Language