Dimension reduction is an important problem in the era of big data. SVD is a classic method for obtaining low-rank approximations of data.
The standard algorithm (which finds all the singular values) is one of the most expensive matrix decomposition algorithms.
Companies like Facebook or Google deal with huge matrices (big data can be big). Often, they don't care about finding all the singular values - perhaps only the first 10 or 20. They may also not need exquisite precision in the singular values. Good approximations might do just fine.
Fortunately, there are randomized algorithms for finding SVDs which work on a relatively simple logic. One approximates the range of the matrix, by repeatedly multiplying it with random vectors, and works with with those.
The algorithm is fairly simple to implement:
In Octave or Matlab, the code can be implemented in about 10 lines.
The resulting truncated-SVD can be a surprisingly good approximation, which can shave multiple orders of magnitude (mileage improves as matrices get bigger) from computation time.
For python, there are decent implementations of randomized SVD in the sklearn package, and the fbpca package from Facebook. This blog post shows some code to call these routines, and provides some benchmarks.
The standard algorithm (which finds all the singular values) is one of the most expensive matrix decomposition algorithms.
Companies like Facebook or Google deal with huge matrices (big data can be big). Often, they don't care about finding all the singular values - perhaps only the first 10 or 20. They may also not need exquisite precision in the singular values. Good approximations might do just fine.
Fortunately, there are randomized algorithms for finding SVDs which work on a relatively simple logic. One approximates the range of the matrix, by repeatedly multiplying it with random vectors, and works with with those.
The algorithm is fairly simple to implement:
Figure from Erichson et al |
The resulting truncated-SVD can be a surprisingly good approximation, which can shave multiple orders of magnitude (mileage improves as matrices get bigger) from computation time.
For python, there are decent implementations of randomized SVD in the sklearn package, and the fbpca package from Facebook. This blog post shows some code to call these routines, and provides some benchmarks.
2 comments:
Last link is broken
http://amedee.me/post/pca-large-matrices/
works.
Thanks a bunch!
Post a Comment