Using truncated SVD to reduce dimensionality使用截断奇异值进行降维

到不了的都叫做远方

修改于 2020-04-20 15:07:10

2.2K0

修改于 2020-04-20 15:07:10

文章被收录于专栏：翻译scikit-learn Cookbook翻译scikit-learn Cookbook

Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principle components of a matrix.

截断奇异值是一个矩阵因子分解技术，将一个矩阵M分解为U、Σ、V，这很像PCA，除了SVD因子分解作用于数字矩阵，而PCA作用于协方差矩阵，一般的，SVD用于发现矩阵藏在面罩下的主要成分

Getting ready准备

Truncated SVD is different from regular SVDs in that it produces a factorization where the number of columns is equal to the specified truncation. For example, given an n x n matrix,SVD will produce matrices with n columns, whereas truncated SVD will produce matrices with the specified number of columns. This is how the dimensionality is reduced.

截断SVD和常规的SVD的不同之处在于它生成数值列等于一个特别的截距的一个因子分解。例如，一个N*N的矩阵，SVD将生成一个N列的矩阵，而截距SVD将生成列的明确值，这就是它降维的方法。

Here, we'll again use the iris dataset so that you can compare this outcome against the PCA outcome:现在我们再次使用iris数据集，以便我们能将输出与PCA输出作比较：

from sklearn.datasets import load_iris
iris = load_iris()
iris_data = iris.data
iris_target = iris.target

How to do it...如何做

This object follows the same form as the other objects we've used. First, we'll import the required object, then we'll fit the model and examine the results:

以下对象和我们之前用的一样，首先，我们导入需要的对象，然后我们拟合模型和训练结果。

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(2)
iris_transformed = svd.fit_transform(iris_data)
iris_data[:5]
array([[ 5.1, 3.5, 1.4, 0.2],
       [ 4.9, 3. , 1.4, 0.2],
       [ 4.7, 3.2, 1.3, 0.2],
       [ 4.6, 3.1, 1.5, 0.2],
       [ 5. , 3.6, 1.4, 0.2]])
iris_transformed[:5]
array([[ 5.91220352, -2.30344211],
       [ 5.57207573, -1.97383104],
       [ 5.4464847 , -2.09653267],
       [ 5.43601924, -1.87168085],
       [ 5.87506555, -2.32934799]])

The output will look like the following:输出如下图所示：

How it works...如何做的

Now that we've walked through how TruncatedSVD is performed in scikit-learn, let's look at how we can use only scipy , and learn a bit in the process.First, we need to use linalg of scipy to perform SVD:

现在我们已经使用并执行了scikit-learn当中的TruncatedSVD函数，让我们来看如何只用scipy来学习该过程，首先，我们要用到scipy的linalg方法来执行SVD：

from scipy.linalg import svd
import numpy as np
D = np.array([[1, 2], [1, 3], [1, 4]])
D
array([[1, 2],
       [1, 3],
       [1, 4]])
U, S, V = svd(D, full_matrices=False)
U.shape, S.shape, V.shape
((3, 2), (2,), (2, 2))

We can reconstruct the original matrix D to confirm U, S, and V as a decomposition:我们能重构原矩阵D来验证U、S、V是他的解。

np.dot(U.dot(np.diag(S)), V)  # np.diag() 返回对角线元素
array([[1, 2],
       [1, 3],
       [1, 4]])

The matrix that is actually returned by TruncatedSVD is the dot product of the U andS matrices.

被用于TruncatedSVD的矩阵确实能够通过点乘U、S矩阵来恢复。

If we want to simulate the truncation, we will drop the smallest singular values and the corresponding column vectors of U. So, if we want a single component here,we do the following:

为了模拟截距，我们需要舍弃最小的奇异值和U的相关列向量，如果我们想要一个成分，我们可以这样做：

new_S = S[0]
new_U = U[:, 0]
new_U.dot(new_S)
array([-2.20719466, -3.16170819, -4.11622173])

In general, if we want to truncate to some dimensionality, for example, t, we drop N-t singular values.

总体来说，如果我们想要截断一些维度为t维，我们舍弃N-t个奇异值。

There's more...扩展阅读

TruncatedSVD has a few miscellaneous things that are worth noting with respect to the method.

TruncatedSVD有一些复杂的地方，很值得在这里提及

Sign flipping混淆符号

There's a "gotcha" with truncated SVDs. Depending on the state of the random number generator, successive fittings of TruncatedSVD can flip the signs of the output. In order to avoid this, it's advisable to fit TruncatedSVD once, and then use transforms from then on.

有个问题，由于随机数生成器的状态，连续的使用TruncatedSVD做拟合会造成符号的混淆，所以明智的做法是用了一次TruncatedSVD拟合后，使用其他变换方法。

Another good reason for Pipelines!用Pipelines的另一个原因

To carry this out, do the following:为了实现这个，这样做:

tsvd = TruncatedSVD(2)
tsvd.fit(iris_data)
tsvd.transform(iris_data)

Sparse matrices稀疏矩阵

One advantage of TruncatedSVD over PCA is that TruncatedSVD can operate on sparse matrices while PCA cannot. This is due to the fact that the covariance matrix must be computed for PCA, which requires operating on the entire matrix.

TruncatedSVD超过PCA的一个优势是TruncatedSVD能处理稀疏矩阵，而PCA不行，这是因为用来计算PCA的协方差矩阵的因子，必须是完整矩阵。

本文系外文翻译，前往查看

如有侵权，请联系?cloudcommunity@tencent.com?删除。

数据分析