By Hong Ooi
by Hong Ooi, Sr. Data Scientist, Microsoft
I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (persist
), fills some holes in support for dplyr verbs, and fixes various bugs.
The persist
verb
A side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:
# pipeline 1
output1 <- flightsXdf %>%
mutate(delay=(arr_delay + dep_delay)/2)
# use the output from pipeline 1
output2 <- output1 %>%
group_by(carrier) %>%
summarise(delay=mean(delay))
# reuse the output from pipeline 1 -- WRONG
output3 <- output1 %>%
group_by(dest) %>%
summarise(delay=mean(delay))
The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file. However, in this case it isn’t what’s desired.
Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.
The new persist
verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.
# pipeline 1 -- use persist to save the data to the working directory
output1 ...read more
Source:: http://revolutionanalytics.com