-- Abstract --

Latent Semantic Indexing Based on Factor Analysis

Noriaki Kawamae (NTT)



The main purpose of this paper is to propose a novel latent semantic indexing (LSI), statistical approach to simultaneously mapping documents and terms into a latent semantic space. This approach can index documents more effectively than the vector space model (VSM). Latent semantic indexing (LSI), which is based on singular value decomposition (SVD), and probabilistic latent semantic indexing (PLSI) have already been proposed to overcome problems in document indexing, but critical problems remain. In contrast to LSI and PLSI, our method uses a more meaningful, robust statistical model based on factor analysis and information theory. As a result, this model can solve the remaining critical problems in LSI and PLSI. Experimental results with a test collection showed that our method is superior to LSI and PLSI from the viewpoints of information retrieval and classification. We also propose a new term weighting method based on entropy.