This event has ended. Visit the official site or create your own event on Sched.
Click here to return to main conference site. For a one page, printable overview of the schedule, see this.
Back To Schedule
Tuesday, June 28 • 4:45pm - 5:03pm
How to keep your R code simple while tackling big datasets

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Like many statistical analytic tools, R can be incredibly memory intensive. A simple GAM (generalized additive model) or K-nearest neighbor routine can devour many multiples of memory size compared to the starting dataset. And, R doesn't always behave nicely when it runs out of memory.

There are techniques to get around memory limitations, like using partitioning tools or sampling down. But these require extra work. It would be really nice to run elegantly simple R analytics without that hassle.

Using a really big, public dataset, from CMS.gov, Chuck will show GAM, GLM, Decision Trees, Random Forest and K Nearest Neighbor routines that were prototyped and run on a laptop then run unchanged on a single simple Linux instance with over a Terabyte of RAM against the entire dataset. This big computer is actually a collection of smaller off-the-shelf servers using TidalScale to create a single, virtual server with several terabytes of RAM.

avatar for Gabriela de Queiroz

Gabriela de Queiroz

Sr. Developer Advocate/Manager, IBM
Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

avatar for Chuck Piercey

Chuck Piercey

KumoScale Product Management, Kioxia
B2B software product management & marketing. Writer: https://medium.com/@chuck1.piercey

Tuesday June 28, 2016 4:45pm - 5:03pm PDT
Barnes & McDowell & Cranston