By John Mount
As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.
Colossus supercomputer : The Forbin Project
R itself is not a language designed for parallel computing. It doesn’t have a lot of great user exposed parallel constructs. What saves us is the data science tasks we tend to use R for are themselves are very well suited for parallel programming and many people have prepared very good pragmatic libraries to exploit this. There are three main ways for a user to benefit from library supplied parallelism:
- Link against superior and parallel libraries such as the Intel BLAS library (supplied on Linux, OSX, and Windows as part of the Microsoft R Open distribution of R). This replaces libraries you are already using with parallel ones, and you get a speed up for free (on appropriate tasks, such as linear algebra portions of lm()/glm()).
- Ship your modeling tasks out of R into an external parallel system for processing. This is strategy of systems such as rx methods from RevoScaleR, now Microsoft Open R, h2o methods from h2o.ai, or RHadoop.
- Use R’s
parallel
facility to ship jobs to cooperating R instances. This is the strategy used in “A gentle introduction to parallel computing in R” and many libraries that sit on top ofparallel
. This is essentially implementing remote procedure call through sockets or networking.
We are going to write more about the third technique.
The …read more
Source:: r-bloggers.com