[Abstract]

Change with Delayed Labeling: when is it detectable?

Shusaku Tsumoto and Shoji Hirano
Department of Medical Informatics, Shimane University, School of Medicine, Japan



One of the most important problems in rule induction methods is how to estimate which method is the best to use in an applied domain. While some methods are useful in some domains, they are not useful in other domains. Therefore it is very difficult to choose one of these methods. For this purpose, we introduce multiple testing based on recursive iteration of resampling methodsfor rule-induction (MULT-RECITE-R). This method consistsoffourprocedures, which includesthe innerloop and the outer loop procedures. First, training samples(S0) are randomly split into new training samples(S1)and test samples(T1) using a resampling scheme. Second, S1 are again split into training sample(S2) and training samples(T2) using the same resampling scheme. Rule induction methods are applied and predefined metrics are calculated. This second procedure, as the inner loop, is repeated for finite times estimated from inner precision preset by users. Then, third, rule induction methods are applied to S1,and the metrics calculatedby T1 are compared with those by T2. If the metrics derived by T2 predicts those by T1, then we count it as a success. The second and third procedures, as the outer loop, are iterated for finite times estimated from the outer precision preset by users. Finally, fourth, the overall results are interpreted, and the best method is selected if the resampling scheme performs well.We apply this MULT-RECITER method to three newly collected medical databases and seven UCI databases. The results show that this method gives the best selection of estimation methods in almost the all cases.