by Seth Mottaghinejad, Data Scientist at Microsoft
R and big data
There are many R packages dedicated to letting users (or useRs if you prefer) deal with big data in R. (We will intentionally avoid using proper case for ‘big data’, because (1) the term has been somewhat hackneyed, and (2) for the sake of this article we can think of big data as any dataset too large to fit into memory as a data.frame so that standard R functions can run on them.) Even without third party packages, base R still puts some toolkits at our disposal, which boil down to doing one of two things: We can either format the data more economically so that it can still be squeeze into memory, or we can deal with the data piecemeal, bypassing the need to load it into memory all at once.
An example of the first approach is to format character vectors as factor, when doing so is appropriate, because factor is stored as integer under the hood which takes less space than the strings it represents. There are of course many other advantages to using factor, but let’s not digress. An example of the second approach consists of processing the data only a certain number of rows at a time, i.e. chunk by chunk, where each chunk can fit into memory and brought into R as a data.frame.
RevoScaleR and big data
The aforementioned chunk-wise processing of data is what the RevoScaleR package does behind the scenes. For example, if we run the rxLinMod function (the counterpart to base R’s lm function), we can run a regression model on a very large dataset, presumably is too large to fit into memory. Even if the dataset could still fit into memory (servers nowadays can have easily have 500GB of RAM), processing it using …read more
Source:: r-bloggers.com