Training a Genre Classifier for Automatic Classification of Web Pages

Vedrana Vidulin; Mitja Luštrek; Matjaž Gams

doi:10.2498/cit.1001137

Training a Genre Classifier for Automatic Classification of Web Pages

Vedrana Vidulin, Mitja Luštrek, Matjaž Gams

Abstract

This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.

Full Text:

PDF

DOI: https://doi.org/10.2498/cit.1001137

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Username
Password
Remember me

Journal of Computing and Information Technology

Training a Genre Classifier for Automatic Classification of Web Pages

Abstract

Full Text: