By emaasit
Introduction
The objective of this blog post is demonstrate how to use Apache SparkR to power Shiny applications. I have been curious about what the use cases for a “Shiny-SparkR” application would be and how to develop and deploy such an app.
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R. Shiny helps you turn your analyses into interactive web applications without requiring HTML, CSS, or JavaScript knowledge.
Use Cases
So you’re probably asking yourself, “Why would I need to use SparkR to run my Shiny applications?”. That is a legitimate question and to answer it, we need to understand the different classes of big data problems.
Classes of Big Data Problems
In a recent AMA on Reddit, Hadley Wickham (Chief Scientist at RStudio) painted a clearer picture of how “Big Data” should be defined. His insights will help us to define uses cases for SparkR and Shiny.
I believe big data problems should be categorized in 3 main classes:
- Big Data-Small Analytics: This is where a data scientist begins with a raw big dataset and then slices and dices that data to obtain the right sample required to answer a specific business/research problem. In most cases the resulting sample is a small dataset, which doesnot require the use of SparkR to run a shiny application.
- Partition Aggregrate Analytics: This is where a data scientist needs to distribute and parallelize computation over multiple machines. Wickham defines this problem as a trivially parallelisable problem. An example is when you need …read more
Source:: r-bloggers.com