This event has ended. Visit the official site or create your own event on Sched.
Click here to return to main conference site. For a one page, printable overview of the schedule, see this.
Tuesday, June 28 • 11:24am - 11:42am
Distributed Computing using parallel, Distributed R, and SparkR

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Data volume is ever increasing, while single node performance is stagnate. To scale, analysts need to distribute computations. R has built-in support for parallel computing, and third-party contributions, such as Distributed R and SparkR, enable distributed analysis. However, analyzing large data in R remains a challenge, because interfaces to distributed computing environments, like Spark, are low-level and non-idiomatic. The user is effectively coding for the underlying system, instead of writing natural and familiar R code that produces the same result across computing environments. This talk focuses on how to scale R-based analyses across multiple cores and to leverage distributed machine learning frameworks through the ddR (Distributed Data structures in R) package, a convenient, familiar, and idiomatic abstraction that helps to ensure portability and reproducibility of analyses. The ddR package defines a framework for implementing interfaces to distributed environments behind the canonical base R API. We will discuss key programming concepts and demonstrate writing simple machine learning applications. Participants will learn about creating parallel applications from scratch as well as invoking existing parallel implementations of popular algorithms, like random forest and kmeans clustering.


Duncan Temple Lang

University of California Davis


Edward Ma

Hewlett Packard Labs

Tuesday June 28, 2016 11:24am - 11:42am PDT