Dimensionality reduction
Dimensionality reduction is one of the main applications of unsupervised learning . It can be understood as the process of reducing the number of random variables under consideration by getting a set of principal variables.[1] High dimensionality has many costs, including redundant and irrelevant features which degrade the performance of some algorithms, difficulty in interpretation and visualization, and infeasible computation.[2]
Components
Dimensionality reduction can be devided into two components or subcategories[3]:
- Feature selection: Consists in finding a subset of the original set of variables, and a subset aimed at modeling the problem. It usually involves three ways[4]:
- Wrappers
- Filters
- Embedded
- Feature extraction: Used to reduce the data in a high dimensional space to a lower dimension space[4].
Algorithms
Some of the most common dimensionality reduction algorithms in machine learning are listed as follows[1]:
- Principal Component Analysis
- Kernel principal component analysis (Kernel PCA)
- Locally-Linear Embedding
Methods
Some common methods to perform dimensionality reduction are listed as follows[4]:
- Missing values:
- Low variance:
- Decision trees:
- Random forest:
- High correlation:
- Backward feature elimination:
- Factor analysis:
- Principal component analysis (PCA):
Disadvantages
One of the main disadvantages of dimensionality reduction is the loss of some amount of data. In the case of PCA, it tends to find linear correlations between variables, which is sometimes undesirable.[4]