Principal Components Analysis (PCA)¶
Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.
Operation |
Computational methods |
Programming Interface |
|||
Mathematical formulation¶
Training¶
Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors [Lang87]) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.
Training method: Covariance¶
This method uses eigenvalue decomposition of the covariance matrix to compute the principal components of the datasets. The method relies on the following steps:
Computation of the covariance matrix
Computation of the eigenvectors and eigenvalues
Formation of the matrices storing the results
Covariance matrix computation is performed in the following way:
Compute the vector-column of sums \(s_i = \sum_{j=1}^n x_{i,j}, \quad 1 \leq i \leq p\).
Compute the cross-product \(P = X^TX - s^Ts\).
Compute the covariance matrix \(\Sigma = \frac{1}{n - 1} P\).
To compute eigenvalues \(\lambda_i\) and eigenvectors \(\upsilon_i\), the implementer can choose an arbitrary method such as [Ping14].
The final step is to sort the set of pairs \((\lambda_i, \upsilon_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (\upsilon_{i,1}, \cdots, \upsilon_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset are returned.
Training method: SVD¶
This method uses singular value decomposition of the dataset to compute its principal components. The method relies on the following steps:
Computation of the singular values and singular vectors
Formation of the matrices storing the results
To compute singular values \(\lambda_i\) and singular vectors \(u_i\) and \(v_i\), the implementer can choose an arbitrary method such as [Demmel90].
The final step is to sort the set of pairs \((\lambda_i, v_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (v_{i,1}, \cdots, v_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset are returned.
Sign-flip technique¶
Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):
where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,
Inference¶
Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the \(r \times p\) matrix \(T\) produced at the training stage, the problem is to transform \(X'\) to the set \(X'' = \{ x_1'', \ldots, x_m'' \}\), where \(x_{j}''\) is an \(r\)-dimensional feature vector, \(1 \leq j \leq m\).
The feature vector \(x_{j}''\) is computed through applying linear transformation [Lang87] defined by the matrix \(T\) to the feature vector \(x_{j}'\),
Programming Interface¶
Usage example¶
Training¶
pca::model<> run_training(const table& data) {
const auto pca_desc = pca::descriptor<float>{}
.set_component_count(5)
.set_deterministic(true);
const auto result = train(pca_desc, data);
print_table("means", result.get_means());
print_table("variances", result.get_variances());
print_table("eigenvalues", result.get_eigenvalues());
print_table("eigenvectors", result.get_eigenvectors());
return result.get_model();
}
Inference¶
table run_inference(const pca::model<>& model,
const table& new_data) {
const auto pca_desc = pca::descriptor<float>{}
.set_component_count(model.get_component_count());
const auto result = infer(pca_desc, model, new_data);
print_table("labels", result.get_transformed_data());
}
Examples¶
Batch Processing:
Batch Processing:
Batch Processing: