Cross-Language Text Filtering Based on Text Concepts and kNN
Weifeng Su, Shaozi Li, Tanqiu Li, Wenjian You
The WWW is increasingly being used source of information. The volume of information is accessed by users using direct manipulation tools. It is obviously that we’d like to have a tool to keep those texts we want and remove those texts we don’t want from so much information flow to us. This paper describes a module that sifts through large number of texts retrieved by the user.
The module is based on HowNet, a knowledge dictionary developed by Mr. Zhendong Dong. In this dictionary, the concept of a word is divided into sememes. In the philosophy of HowNet, all concepts in the world can be expressed by a combination more than 1500 sememes. Sememe is a very useful concept in settle the problem of synonym which is the most difficult problem in text filtering. We classified the set of sememes into two sets of sememes: classfiable sememes and unclassficable semems. Classfiable sememes includes those sememes that are more useful in distinguishing a document’s class from other documents. Unclassfiable sememes include those sememes that have similar appearance in all documents. Classfiable includes about 800 sememes. We used these 800 classficable sememes to build Classficable Sememes Vector Space(CSVS).
A text is represented as a vector in the CSVS after the following step:
1. text preprosessing: Judge the language of the text and do some process attribute to its language.
2. Part-of-Speech tagging
3. keywords extraction
4. keyword sense disambiguation based on its
environment by calculating its classifiable sememes
relevance with it’s environment’s classifiable sememes. We add the weight of a semantic item if
there are classifiable sememes the same as classifiable sememe in the its environment word’s
semantic item. This is not a strict disambiguation algorithm. We just adjust the weights of those
5. Those keywords are reduced to sememes and the
weight of all keywords ‘s all semantic items ‘s
classifiable sememes are calculated to be the weight of its vector feature.
A user provides some texts to express the text he interested in. They are all expressed as vectors in the CSVS. Then those vectors represent the user’s preference. The relevance of two texts can be measured by using the cosine angle between the two text’s vectors. When a new text comes, it is expressed as a vector in CSVS too. We find its k nearest neighbours in the texts provided by the user in the CSVS . Calculating the relevance of the new text to its k nearest neighbours and if it is bigger than a certain valve, than it means it is of the user’s interest if smaller, it means that it is not belong to the user’s interesting. The k is determined by calculated every training vector its neighbours.
Information filtering based on classifiable sememes has several advantage:
1. Low dimentional input space. We use 800 sememes instead of 10000 words.
2. Few irrelevant feature after the keyword extraction and unclassifiable sememes’s removal.
3. Document vector’s feature’s weight are big.
We made use of documents from eight different users in our experiments. All these users provides texts both in Chinese and English. We took into account the user’s feedback and got a result of about 88 percent of recall and precision. It demonstrates that this is a success method.
Classfiable Sememe, Vector Space, kNN, Text Representation, HowNet
PDF Version: paper4.pdf