The recent technological developments in areas such as the internet, DNA microarray, hyperspectral imagery, database among other notable areas have resulted to the emergency of large amounts of data in wide spectrums of applications. These include areas such as search engines, proteomics, genomic, text categorization, and information retrieval. According to Tan (2007, p. 17), it is estimated that, the total amount of information stored doubles each twenty months. Machine learning thus offers tools to alleviate this problem by analyzing large quantity of data automatically. However, it is notable that, the many applications of features or attributes makes it hard for machine learning to extract all the useful information from a gigantic data streams. Guyon (2003, p. 42) candidly indicates that machine learning provides the necessary tools, through which, large quantities of data may be automatically analyzed. Future selection, which selects a subset of all the most salient features as well as removes redundant, noisy and irrelevant features is the process that usually employed in machine learning in solving highly dimensionally problems.
Feature selection focuses the learning algorithms to most useful and crucial aspects of data, thus making learning task more accurate as well as fast. It is notable that in the last five years, feature and variable selection have become one of the main areas of focus in research. This is especially in area of applications for which datasets with hundreds of thousands of variables are present. The objective of feature selections is mainly three fold; offering more coat-effective and faster predictors, developing the prediction performances of all the available predictors, and offering an improved understanding of all the underlying processes, which generated the data. The contributions of the feature and variable selection cover wide range of aspects of the associated problems, such as feature ranking, feature construction, multivariate feature selection, feature validity assessment, as well as offering improved definition of the available objective functions. As indicated above, the rapid developments of applications and novel technologies having large as well as more complex accumulation of data, mainly at unprecedented speeds, there has been collection of large and unknown quantities of candidate features, usually collected to represent data.
Tan (2007, p. 18), indicate that although the features, which are irrelevant may fail to add anything to the target concept learning through the learning machine as well as the redundant features, they raise the computation costs of the overall learning process. There are four main advantages of using feature selection in supervised machine learning. First, feature selection reduces computational complexity of prediction and learning algorithms. Most of the well-known learning algorithms have become computational intractable, especially in the presence of huge numbers of features. This applies in the prediction step and training step. Therefore, preceding step of feature selections can enormously solve this problem. Feature selection greatly saves on the costs of measuring non-selected figures. Once a smaller set of the features, which allow good predictions of all the labels are established, one does not have to measure the other features. The other advantage of feature selection, especially in supervised machine learning, is higher levels of improved accuracy. In most circumstances, feature selection can enhance the accuracy of predictions by improving signals to noise ratio. Generally, in most of the real word tasks, the dimensionality of the data is very high such that it is practically prohibitive or computationally costly for machine learning. Most of the traditional learning algorithms have failed to scale o the large size problems as a result of the curse of the dimensionality. In addition to this, the existence of the noisy features highly degrades the performance of learning algorithms. Feature selection techniques are thus crucial in solving this problem in machine learning. There are two types of machine learning, supervised and unsupervised machine learning as indicated on the diagram below.
In the supervised learning, all the class labels of training data are well known. The training example is well represented as pairs of input objectives as well as its desired output, such as class label. The main task of supervised learners is to establish functions to approximate the mapping between the training data to their classes, in order to predict the classes of the new data. Some of the algorithms and approaches that have been proposed for the supervised learning include k-nearest neighbor, decision trees, naïve Bayers classifiers, SUVs (Support Vector Machines) and random forests. Guyon (2003, p. 40) supervised learning can be termed as the machine learning duty of inferring a function a labeled (supervised) training data. The training data mainly consists of sets of training examples. In a supervised learning, every example is a pair consisting of an input object, typically a vector, and an output value, also known as the supervisory signal. Supervised learning algorithm analyzes procedures and training data an inferred function, which is known as a classifier, if the output data is discrete, or a regression function, if the output data is continuous.
On the other hand, unsupervised machine learning can clearly be distinguished from the supervised machine learning due to the fact that the class labels used for training data are not usually available. The unsurprised learning method mostly decides which objects can be grouped together in one class. Put in other words, they mostly learn class by themselves. The K-nearest neighbors, self-organizing as well as data clustering algorithms, such as fuzzy c-means clustering and K-means clustering are mostly used for the unsupervised learning tasks. It is crucial to note that good representation of the input objects is crucial due to the fact that the accuracy of the entire learned model enormously depends strongly on the way input objects are indicated typically, the entire input object is changed into a vector of attributes or features, thus being used to describe the object.
There are different feature selection algorithms having various objectives to be attained. Some of these objectives include:
- Finding the minimally sized feature subset sufficient and necessary to the target concept.
- Selecting a subset N features from a set of M features such that M , such that all the values of the criterion function is fully optimized, over all the subset size N
- Selecting a subset of features aimed at improving the accuracy of prediction or reducing the size of the structure, without substantially reducing on the accuracy of prediction of all the classifiers, built by using only the features selected.
- Selecting small subset such that, the resulting class distribution, is very close to the original class distribution if all feature values are given.
1.2. Feature Selection Procedure
As indicated above feature selection, also known as feature reduction, variable selection, variable subset selection or attribute selection is a method of selecting a subset of the relevant features, useful for building a robust learning model. Guyon (2003, p. 40) argues that feature selection is critical in problem understanding. This is due to the fact that the selected features can offer great insights on the nature of the problem being tackled. This is enormously crucial, since in most circumstances, the ability to point out the important informative features is necessary as compared to the ability of making good predictions in itself. The main goal of the feature selection is to select small sublets of feature, such that recognition rate of all classifiers do not decrease substantially. Liu and Yu (2005, p. 32) argues that the feature selection method depends on the way in which subsets are generated as well as the evaluation function employed in evaluating all the subsets under examination. There are numerous types of feature selection procedure based on generation procedures of all the subsets as well as the evaluation functions employed to evaluate them. However, a typical feature selection procedure consists of four basic steps, namely:
- Subset generation
- Subset evaluation
- Stopping criterion, and
- Results validation
The flow chart below briefly indicates the four notable processes
The process starts with the generation of subset that employs a particular search strategy in order to give rise to a candidate feature subsets. This is followed by evaluation of each candidate subset according to a particular evaluation criterion and then compared with the best subset. If it is better, the previous best one is replaced. It is notable that the process of generation and evaluation of the subset is then repeated until a certain stopping criterion is attained. Finally, the selected best feature subset is carefully validated by the test data or prior knowledge. Liu and Yu, (2005, p. 20) argues that search strategy as well as evaluation are the key points when studying feature selection.
1.2.1 Subset Generation
As indicated by Ng (1998, p. 14), subset generation is first a process involving a heuristic search, with every state in the search space having a certain candidate subset for the purpose of evaluation. It is notable that the nature of this process is enormously determined by two basic issues. To start with, one must make a decision on the search point or points that will in turn going to influence the direction. The search process may start with a set that is empty and successively add features (forward) or begin with a full set of and then successively remove features (backward). The search can also begin with both points and then successively add and remove features (bi-directional). Further, search can also begin with a subset that is randomly selected. This is crucial as it avoids the subset being trapped in local optima (Liu and Yu, 2005, p. 16). Secondly, one must decide on a search strategy to be used. For a data set having N features, then there exists candidate subsets. For an exhaustive search the search space is exhaustive, even where N is moderate. There are other search strategies, which have been explored. This include complete, random and sequential search.
126.96.36.199 An exhaustive Search
This method guarantees to establish all the optimal results as dedicated by the evaluation criterion adopted. In other words, exhaustive search is complete as no any optimal subset is missed. For instance, for a data set having five parameters, X1… X4, X5 there can be a number of combinations of the independent variables as indicated on the table below.
(Rogati and Yang, 2002, p. 44)
Note: For a given dataset having Y independent parameters, exhaustive search method will have 2^Y-1 regression models to selected from. This type of search is said to be exhaustive due to the fact that the search is guarantee to generate all the reachable states, prior to it terminate with failure. Liu and Yu (2005, p. 12) candidly indicates that an exhaustive search always build a regression model, with each possible combination of the parameters. However, it is crucial noting that, if a search is complete this does not mean that it must have been fully exhaustive. This is based on the fact that various heuristic functions may be used, in order to reduce the selected search space without necessarily jeopardizing any chances of establishing the optimal results. Though the order of the selected search space is O, smaller number of subsets can be evaluated. An exhaustive search may be performed, on the condition that the numbers of variables are not too large. A wide variety of search strategies, such as branch-and-bound, best-first, genetic algorithms, simulated annealing among others, may be used. Some of the notable examples include branch, bound and beam search. The branch and bound procedure performs an exhaustive search that is done in an orderly fashion, such as search tree, but it halt the search a long a given branch, if bound or limit is exceeded. This can be done if sub-solution fails to look promising. In spite of the time complexity, this search method is usually fast for some given problems.
188.8.131.52 Sequential Search
This method gives up completeness, therefore risks loosing the optimal subsets. There are various variations to the greedy hill-climbing method. Some of them include sequential backward elimination, sequential forward selection, as well as bi-directional selection. Sequential forward selection begins with a feature subset that is empty. For all iteration, exactly one of the features is added to the subset. In order to determine the feature to be added, the algorithm exactly adds to the existing candidate feature subset one of the feature, which is already selected as well as tests the accuracy of the classifier build on the tentative subset. Sequential backward deletion is almost similar to the forward selection, but this one begins with all the features as well as tries to remove the feature, resulting in the highest levels of accuracy gain. In some instances, a greedier version of the sequential search method can be implemented where there is the possibility of deleting or adding more feature o very iteration. In supervised machine learning, the greedier version is faster, due to the fact that, it examines reduced number of candidate feature subset. It is notable that these methods add or remove some features each at a time. The other method is to add or remove p features in a single step and adding or removing q features in the proceeding step (p q) (Liu and Yu, 2005, p. 15) It is simple to implement algorithms having sequential search. It is also fast to produce results. This is due to the fact that, the order of the sequential search space is mostly O, or even less.
184.108.40.206 Random Search
Random search usually start with a subset that is randomly selected and then proceeds into two distinct ways. One ways is to follow sequential search, thus injecting randomness into the sequential approach. Examples include simulated annealing and random-start hill-climbing. The other way is to generate the subsequent subsets in completely random manner, a method that is known as Las Vegas algorithm. According to Tan (2007, p. 21), random search evaluates large sets of random feature subsets and the returns the best feature. Based on this, it is crucial to implement more complex strategies of selection, such as the genetic feature selections or as initialization of the greedy search. For instance, bad choices that are done on the early stages of greedy search usually become hard to be undone, while forward search mostly suffers if all the features, which are judged individually, are poor. Generally for all the approaches, the use of randomness significantly helps in escaping local optima in the given search space. Further, the optimality of the selected subsets largely depends on all the available resources.
220.127.116.11 Individual Search
Liu and Yu (2005, p. 13) argues that the individual search method evaluates every feature separately. The main advantage of this is the fact that it is a high speed search method s compared to others such as exhaustive, random, sequential search among others. As a result of this, individual search is used in for pre-selection of candidate feature subsets from large sets of feature. However, it is crucial to note that, individually, poor features can yield to a high class of separability when they are used together. In individual search, it is possible for one to provide additional test sets using test options.
1.2.2 Subset Evaluation
As indicated above, every newly generated subset requires to be evaluated carefully by a given evaluation criterion. It is crucial to note that the goodness of the selected subset is usually determined by a particular criterion. That is, an optimal subset selected using a certain criterion may fail to be an optimal according to the other give criterion. One of the paradigms, which have been dubbed as filter, operates independently of any of the learning algorithm. This is due to the fact that all the undesired features are filtered out of the data prior to the commencement of learning. These algorithms mostly use heuristics based on the general characteristics of data to evaluate merits of the feature subsets. The other school of thought indicates that, the bias of the given induction algorithm ought to be taken into account before selecting a feature. This method is known as wrapper method. It uses an induction algorithm along side a statistical re-sampling technique, such as the cross-validation in estimating the final accuracy to all feature subsets. Generally, feature selection algorithms can be divided in three broad categories namely wrapper, hybrid and filter.
18.104.22.168 Filter Method
Independent of all the other induction algorithms, the filter method filters out all redundant, noisy, or irrelevant features in the processing steps before induction takes place. Liu, Liu, and Yu (2005, p. 10) argues that filter method select a feature set for any of the learning algorthms to be used when leraning concepst from the training set. The bisaness of all the future selection alogrithms as well as the learning alogrithm do not intearct. The search usually proceeds until pre-specified number of the feature is obtained or some thresholding criteria is met. The key advantage of the filter method as comapred to other methods, such as wrapper method among others, is the fact that this method is fast as well as the ability to scale to the large database.Unlike other methods, such as wrappers, filters usually utilizes the intrinsic properties of the data in evaluating feature subset. In general, the featurse are mostly assesed using their discriminatory or releavnces powers in regard to the target classes.
22.214.171.124 Wrapper Method
In wrapper methods, the performances, such as prediction or classification accuracy, of an induction algorithm is used for the feature subset evaluation as indicated o the figure below.
In most of its general formulation, the wrapper method consists in the usage of prediction performance of a certain learning machine, in order to assess the overall usefulness of all the usefulness of the subsets of variables. When applying the wrapper method one should define, the
- How to search all the usefulness of the subset of variables
- How to assess the prediction performances of learning machine, in order to guide the search as well as halt it
- Which predictors will be employed
Wrappers are usually criticized due to the fact that they are seen as a “brute force” method, which require massive amount of computational, though this is not necessarily the case. Greedy search strategies are usually seen as computational advantageous as well as robust against over fitting. In forward selection, wrappers are useful in estimating the accuracy of adding every unselected feature to all the feature subset as well as choosing the best feature to be added according to this kind of criterion. The method usually terminates when the estimated accuracy of adding any of the feature is less as compared to the estimated accuracy of the feature set already selected. Generally, for every generated feature subset S, the wrappers usually evaluates its applicability by using induction algorithms to dataset using the features present in the subset S. Wrappers may easily find features subsets within the high level of accuracy. This is due to the fact that the features match well with all the learning algorithms (Liu, and Yu, 2005, p. 12).
126.96.36.199 Hybrid Method
In order to improve the classification performances as well as fasten feature selection, it is advisable to build a hybrid model. This is due to the fact that it takes the advantage of wrappers and filters by employing both independent criteria and the learning algorithms in measuring feature subsets. Filters offers an intelligence guide for the wrappers, like reducing search space, offering a good start point or shortening the search path, all of which are crucial in scaling wrappers to the larger size problems. Hybrid method employs the independent measures in deciding the subset that is best for any given cardinality (Liu, and Yu, 2005, p. 13).
188.8.131.52 Feature Ranking
Among the entire proposed feature selection algorithm, the aspect of feature ranking methods, which rank or score features by a particular criterion. The use rankings of the features as base of selection mechanism are mostly attractive due to their scalability, simplicity as well as the good empirical success. Computationally, the feature ranking is efficient due to the fact that it needs only the computation of M scores as well as sorting all the scores. With regard to ranks of the features, the subsets of important features may be selected in order to build classifiers or predictors.
1.2.3 Stopping Criterion
It is important to note that a feature selector should be able to decide when to stop searching through the space of the feature subsets. Depending on evaluation strategies which are available, a feature selection may remove or adding features, when none of the alternative improves on the merits of the current feature subsets. Alternatively, algorithm may continue to revise feature subset as long as the merit has not degraded. Based on this argument, a further option may be to continue with generation of subsets until reaching the other end of search. The best subset is then selected. In most cases, the test is not able to measure the distance existing between the last iterate and the true solution. Generally, a feature selection process may terminate if:
- The search is complete
- A predefined size of the feature subset has been selected
- A predefined number of iterations have been executed
- A sufficiently good or an optimal feature subset has been obtained, or
- When the change of feature subset does not result to a better subset (Tan, 2007, p. 23).
1.2.4 Result validation
In some of the applications, all the relevant features are usually known beforehand, thus allowing one to validate the results of feature selection. However, in many real-world applications it is hard to understand the features that are relevant. Hence, one has to rely on some of the indirect methods through monitoring the change of the mining performance with all the changes of features (Liu and Dash, 1997, p. 23).