-- Abstract --

The classification of text documents using context information

Noriaki Kawamae (Center for Advanced Research and Technology The University of Tokyo)

keywords: knowledge assimilation, discovery and learning



This paper proposes the technique of classifying text documents into hierarchical category structure by using the context information on the documents. The existing technique has classified the document for a classification as not a set of a context but a set of single words, which appear in a document. Using single words as indexing unit may be introduce a problem. For example, "dragonfly". The single words "dragon","fly", which are contained in it, can't make sence of "dragonfly". The proposal technique is classified using the context of the text documents for a classification. It is thought that the context information of documents is reflected in the combination and its appearance frequency of a word of a sentence unit. This paper shows that this technique performs dignity attachment according to the appearance frequency of the context, classifies documents using a suffix tree, and can classify it according to clustering to the layered structure of a category. We use a noun among parts of speech. Because the words serves as a classification standard and are identify of a class. The suffix tree is the compact trie containing all the suffixes of all the strings. From the suffix tree, we can find the strings quickly. Edges of the suffix tree are labeled with non-empty strings. Edges from the same node are labeled with strings that start with different words. We improve the suffix tree so that can deal not strings but the context inforamation. A character of strings corresponds to a word, an order of characters corresponds to an order of word. The documents have many contexts. Using the suffix tree, it is necessary to decide the order of words. Therefore, we decide the order of words by the mutual information and the entropy. If the mutual information of words is 0, it means the words are independent and have no relation in an appearance in documents. If the entoropy of words is close 1, it means the words appear in documents by grade probability. Otherewise, it means the words only appear in few documents. We can build the suffix tree of documents by the mutual information and the entropy. The built suffix tree classifies documents by the combination of words. The words labeled in edges are used for constructing a hierarchical category structure. Leaf of the tress is labeled with documents ID. Secondly, we build a hierarchical category structure based on the built suffix tree. Therefore, we define the similarity of words by the SC(Stochastic Complexity). The value of SC is directory proportion to the number of documents sharing in words compared. The lower the value of SC is, the higher the relevance of words compared is. To building the hierarchical category structure from the suffic tree, we use SC. Generation of the hierarchical category structure, which suites a classification of text documents with two or more contents can be realized by the proposal technique. And, the hierarchical category structure is a variable hierarchical category. Because the hierarchical category not only change with people, but also documents for classification. The built hierarchical category structure is expressing the documents briefly by words and their mutual information, entropy.