Web Page Categorisation with Ant Colony Optimization

Description:

The categorisation of textual documents typically involves two main phases: training and classification. In the training phase, documents are examined and those keywords deemed important are retrieved. These sets of keywords are usually very large, requiring some form of dimensionality reduction to take place before any useful classification means can be derived. Feature selection is one way of achieving this reduction. It aims to select those input features that are most predictive of a given outcome, removing unnecessary or misleading attributes.

Swarm Intelligence (SI) is the property of a system whereby the collective behaviours of simple agents interacting locally with their environment cause coherent functional global patterns to emerge (Bonabeau, Dorigo and Theraulez 1999). For example, ants are capable of finding the shortest route between a food source and their nest without the use of visual information, adapting to changes in the environment.

The aim of this project is to investigate how a feature selection method based on Ant Colony Optimization (ACO) may be applied within a web page classification framework. The resulting system should be able to classify unseen pages into several pre-defined categories (e.g. News, Sport, Computing). Part of this project will involve reading into the fundamentals of ACO and text classification, and also collecting the data to be used in the training phase.

References:

E. Bonabeau, M. Dorigo, and G. Theraulez. Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press Inc., New York, NY, USA. 1999.

R. Jensen and Q. Shen. Fuzzy-Rough Data Reduction with Ant Colony Optimization. Fuzzy Sets and Systems, 149(1):5-20. 2005.

Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishing, Dordrecht. 1991.