This event has ended.
Visit the official site or create your own event on Sched.

Click here to return to main conference site. For a one page, printable overview of the schedule, see this.

Dynamic Documents with R Markdown (Part 1)

This is an intermediate/advanced level tutorial on dynamic documents with R Markdown. It starts with the basic idea of literate programming as well as its role in reproducible research. Among all document formats that **knitr** supports, we will only focus on R Markdown (.Rmd). We will give an overview of existing output formats in **rmarkdown**, and explain how to customize them. We will show how to build new output format functions by extending exising formats. The packages **tufte** and **bookdown** will be used as examples. We will mention other applications related to R Markdown such as HTML widgets [Vaidyanathan et al., 2015], Shiny documents [Chang et al., 2015], and how to run code from other languages (C, C++, and so on).

Monday June 27, 2016 9:00am - 10:15am PDT

SIEPR 130

SIEPR 130

Genome-wide association analysis and post-analytic interrogation with R (Part 1)

For complex traits, such as cardiometabolic disease, we increasingly recognize that the intergeneric space between protein coding genes (PCGs) contains highly ordered regulatory elements that control expression and function of PCGs and in themselves can be actively transcribed molecules. Indeed, over 50% of genome-wide association studies (GWAS) of complex traits identify single nucleotide polymorphisms (SNPs) that fall in intergenic regions and it is only recently becoming apparent that these regions are highly organized to perform specific functions. A next step in advancing precision medicine is careful and rigorous interrogation of the role of these regulatory elements, and their interplay with known PCGs and environmental factors, in the heritability of complex disease phenotypes. This tutorial focuses on analytic techniques and R tools designed to uncover these complex, and largely uncharacterized relationships.

For details, refer to tutorial description.
Monday June 27, 2016 9:00am - 10:15am PDT

Wallenberg Hall 124

Wallenberg Hall 124

Handling and analyzing spatial, spatiotemporal and movement data (Part 1)

**Speakers**
## Edzer Pebesma

The tutorial will introduce users to the different types of spatial data (points, lines, polygons, rasters) and demonstrate how they are read in R. It will also explain how time series data can be imported, handled and analyzed in R. Then, it will explain the different types of spatiotemporal data and trajectory data, and present ways of importing them and analyzing them.

For details, refer to tutorial description.

For details, refer to tutorial description.

professor, University of Muenster

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Monday June 27, 2016 9:00am - 10:15am PDT

SIEPR 120

SIEPR 120

Machine Learning Algorithmic Deep Dive (Part 1)

The goal of this tutorial is to provide participants with a deep understanding of four widely used algorithms in machine learning:Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets. This includes a deep dive into the algorithms in the abstract sense, and a review of the implementations of these algorithms available within the R ecosystem.

Due to their popularity, each of these algorithms have several implementations available in R. Each package author takes a unique approach to implementing the algorithm, and each package provides an overlapping, but not identical, set of model parameters available to the user. The tutorial will provide an in-depth analysis of how each of these algorithms were implemented in a handful of R packages for each algorithm.

After completing this tutorial, participants will have a understanding of how each of these algorithms work, and knowledge of the available R implementations and how they differ. The participants will understand, for example, why the xgboost package has, in less than a year, become one of the most popular GBM packages in R, even though the gbm R package has been around for years and has been widely used -- what are the implementation tricks used in xgboost that are not (yet) used in the gbm package? Or, why do some practioners in certain domains prefer the one implementation over another? We will answer these questions and more!

For details, refer to tutorial description.
Monday June 27, 2016 9:00am - 10:15am PDT

Campbell Rehearsal Hall

Campbell Rehearsal Hall

MoRe than woRds, Text and Context: Language Analytics in Finance with R (Part 1)

**Speakers**
## Sanjiv Das

*KM*
## Karthik Mokashi

This tutorial surveys the technology and empirics of text analytics with a focus on nance applications. We present various tools of information extraction and basic text analytics. We survey a range of techniques of classication and predictive analytics, and metrics used to assess the performance of text analytics algorithms. We then review the literature on text mining and predictive analytics in nance, and its connection to networks, covering a wide range of text sources such as blogs, news, web posts, corporate lings, etc. We end with textual content presenting forecasts and predictions about future directions. The tutorial will use the R programming language throughout and present many hands-on examples.

For details, refer to tutorial description.

For details, refer to tutorial description.

Terry Professor of Finance and Data Science, Santa Clara University

Text Analytics, FinTech, Network Risk.

Santa Clara University

I am pursuing my Master's at Santa Clara University with a focus on Data Science and Business Analytics. I am passionate about improving my knowledge of the different models and the accompanying tools, and how they are applied to diverse business questions. I am working on building... Read More →

Monday June 27, 2016 9:00am - 10:15am PDT

McDowell & Cranston

McDowell & Cranston

Never Tell Me the Odds! Machine Learning with Class Imbalances (Part 1)

This tutorial will provide an overview of using R to create effective predictive models in cases where at least one class has a low event frequency. These types of problems are often found in applications such as: click through rate prediction, disease prediction, chemical quantitative structure - activity modeling, network intrusion detection, and quantitative marketing. The session will step through the process of building, optimizing, testing, and comparing models that are focused on prediction. A case study is used to illustrate functionality.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 9:00am - 10:15am PDT

Econ 140

Econ 140

Small Area Estimation with R (Part 1)

The tutorial will introduce different types of statistical methods for the analysis of survey data to produce estimates for small domains (sometimes termed ‘small areas’). This will include design-based estimators, that are only based on the study design and observed data, and model-based estimators, that rely on an underlying model to provide estimates. The tutorial will cover frequentist and Bayesian inference for Small Area Estimation. All methods will be accompanied by several examples that attendants will be able to reproduce.

This tutorial will be roughly based on the tutorial presented at useR! 2008 but will include updated materials. In particular, it will cover new R packages that have appeared since then.

For details, refer to tutorial description.
Monday June 27, 2016 9:00am - 10:15am PDT

Lane

Lane

Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution (Part 1)

In the realm of marketing analytics, time to event modeling at the customer level can provide a more granular view of the incremental impact that marketing campaigns have on individuals. Media that is addressable can be mapped to an individual, and even aggregated data can be mapped down to an individual via various techniques (i.e. geo, dma, etc.). To accurately assess the incremental effect of marketing, a primary task during modeling is not only to estimate the magnitude/amplitude of the marketing effect, but also to capture the differing decay rates that each specific one has.

This tutorial will describe the basic techniques of applying time-to-event statistical modeling techniques to marketing analytics problems. Beginning with data preparation, sampling, outlier detection and techniques to control for non-marketing effects, the tutorial will move on to consider various modeling strategies and methods for evaluating model effectiveness. The techniques and processes presented will mimic a typical marketing analytics workflow. We will be using a random sample from a (anonymized) large retail firm.

For details, refer to tutorial description.
Monday June 27, 2016 9:00am - 10:15am PDT

Barnes

Barnes

Using Git and GitHub with R, Rstudio, and R Markdown (Part 1)

Data analysts can use the Git version control system to manage a motley assortment of project files in a sane way (e.g., data, code, reports, etc.). This has benefits for the solo analyst and, especially, for anyone who wants to communicate and collaborate with others. Git helps you organize your project over time and across different people and computers. Hosting services like GitHub, Bitbucket, andGitLab provide a home for your Git-based projects on the internet.

What's special about using R and Git(Hub)?

- the active R package development community on GitHub
- workflows for R scripts and R Markdown files that make it easy to share source and rendered results on GitHub
- Git- and GitHub-related features of the RStudio IDE

Monday June 27, 2016 9:00am - 10:15am PDT

Lyons & Lodato

Lyons & Lodato

Dynamic Documents with R Markdown (Part 2)

This is an intermediate/advanced level tutorial on dynamic documents with R Markdown. It starts with the basic idea of literate programming as well as its role in reproducible research. Among all document formats that **knitr** supports, we will only focus on R Markdown (.Rmd). We will give an overview of existing output formats in **rmarkdown**, and explain how to customize them. We will show how to build new output format functions by extending exising formats. The packages **tufte** and **bookdown** will be used as examples. We will mention other applications related to R Markdown such as HTML widgets [Vaidyanathan et al., 2015], Shiny documents [Chang et al., 2015], and how to run code from other languages (C, C++, and so on).

Monday June 27, 2016 10:30am - 12:00pm PDT

SIEPR 130

SIEPR 130

Genome-wide association analysis and post-analytic interrogation with R (Part 2)

For complex traits, such as cardiometabolic disease, we increasingly recognize that the intergeneric space between protein coding genes (PCGs) contains highly ordered regulatory elements that control expression and function of PCGs and in themselves can be actively transcribed molecules. Indeed, over 50% of genome-wide association studies (GWAS) of complex traits identify single nucleotide polymorphisms (SNPs) that fall in intergenic regions and it is only recently becoming apparent that these regions are highly organized to perform specific functions. A next step in advancing precision medicine is careful and rigorous interrogation of the role of these regulatory elements, and their interplay with known PCGs and environmental factors, in the heritability of complex disease phenotypes. This tutorial focuses on analytic techniques and R tools designed to uncover these complex, and largely uncharacterized relationships.

For details, refer to tutorial description.
Monday June 27, 2016 10:30am - 12:00pm PDT

Wallenberg Hall 124

Wallenberg Hall 124

Handling and analyzing spatial, spatiotemporal and movement data (Part 2)
The tutorial will introduce users to the different types of spatial data (points, lines, polygons, rasters) and demonstrate how they are read in R. It will also explain how time series data can be imported, handled and analyzed in R. Then, it will explain the different types of spatiotemporal data and trajectory data, and present ways of importing them and analyzing them.

For details, refer to tutorial description.

**Speakers**
## Edzer Pebesma

For details, refer to tutorial description.

professor, University of Muenster

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Monday June 27, 2016 10:30am - 12:00pm PDT

SIEPR 120

SIEPR 120

Machine Learning Algorithmic Deep Dive (Part 2)

The goal of this tutorial is to provide participants with a deep understanding of four widely used algorithms in machine learning:Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets. This includes a deep dive into the algorithms in the abstract sense, and a review of the implementations of these algorithms available within the R ecosystem.

Due to their popularity, each of these algorithms have several implementations available in R. Each package author takes a unique approach to implementing the algorithm, and each package provides an overlapping, but not identical, set of model parameters available to the user. The tutorial will provide an in-depth analysis of how each of these algorithms were implemented in a handful of R packages for each algorithm.

After completing this tutorial, participants will have a understanding of how each of these algorithms work, and knowledge of the available R implementations and how they differ. The participants will understand, for example, why the xgboost package has, in less than a year, become one of the most popular GBM packages in R, even though the gbm R package has been around for years and has been widely used -- what are the implementation tricks used in xgboost that are not (yet) used in the gbm package? Or, why do some practioners in certain domains prefer the one implementation over another? We will answer these questions and more!

For details, refer to tutorial description.
Monday June 27, 2016 10:30am - 12:00pm PDT

Campbell Rehearsal Hall

Campbell Rehearsal Hall

MoRe than woRds, Text and Context: Language Analytics in Finance with R (Part 2)
This tutorial surveys the technology and empirics of text analytics with a focus on nance applications. We present various tools of information extraction and basic text analytics. We survey a range of techniques of classication and predictive analytics, and metrics used to assess the performance of text analytics algorithms. We then review the literature on text mining and predictive analytics in nance, and its connection to networks, covering a wide range of text sources such as blogs, news, web posts, corporate lings, etc. We end with textual content presenting forecasts and predictions about future directions. The tutorial will use the R programming language throughout and present many hands-on examples.

For details, refer to tutorial description.

**Speakers**
## Sanjiv Das

*KM*
## Karthik Mokashi

For details, refer to tutorial description.

Terry Professor of Finance and Data Science, Santa Clara University

Text Analytics, FinTech, Network Risk.

Santa Clara University

I am pursuing my Master's at Santa Clara University with a focus on Data Science and Business Analytics. I am passionate about improving my knowledge of the different models and the accompanying tools, and how they are applied to diverse business questions. I am working on building... Read More →

Monday June 27, 2016 10:30am - 12:00pm PDT

McDowell & Cranston

McDowell & Cranston

Never Tell Me the Odds! Machine Learning with Class Imbalances (Part 2)
This tutorial will provide an overview of using R to create effective predictive models in cases where at least one class has a low event frequency. These types of problems are often found in applications such as: click through rate prediction, disease prediction, chemical quantitative structure - activity modeling, network intrusion detection, and quantitative marketing. The session will step through the process of building, optimizing, testing, and comparing models that are focused on prediction. A case study is used to illustrate functionality.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 10:30am - 12:00pm PDT

Econ 140

Econ 140

Small Area Estimation with R (Part 2)

The tutorial will introduce different types of statistical methods for the analysis of survey data to produce estimates for small domains (sometimes termed ‘small areas’). This will include design-based estimators, that are only based on the study design and observed data, and model-based estimators, that rely on an underlying model to provide estimates. The tutorial will cover frequentist and Bayesian inference for Small Area Estimation. All methods will be accompanied by several examples that attendants will be able to reproduce.

This tutorial will be roughly based on the tutorial presented at useR! 2008 but will include updated materials. In particular, it will cover new R packages that have appeared since then.

For details, refer to tutorial description.
Monday June 27, 2016 10:30am - 12:00pm PDT

Lane

Lane

Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution (Part 2)

In the realm of marketing analytics, time to event modeling at the customer level can provide a more granular view of the incremental impact that marketing campaigns have on individuals. Media that is addressable can be mapped to an individual, and even aggregated data can be mapped down to an individual via various techniques (i.e. geo, dma, etc.). To accurately assess the incremental effect of marketing, a primary task during modeling is not only to estimate the magnitude/amplitude of the marketing effect, but also to capture the differing decay rates that each specific one has.

This tutorial will describe the basic techniques of applying time-to-event statistical modeling techniques to marketing analytics problems. Beginning with data preparation, sampling, outlier detection and techniques to control for non-marketing effects, the tutorial will move on to consider various modeling strategies and methods for evaluating model effectiveness. The techniques and processes presented will mimic a typical marketing analytics workflow. We will be using a random sample from a (anonymized) large retail firm.

For details, refer to tutorial description.
Monday June 27, 2016 10:30am - 12:00pm PDT

Barnes

Barnes

Using Git and GitHub with R, Rstudio, and R Markdown (Part 2)

Data analysts can use the Git version control system to manage a motley assortment of project files in a sane way (e.g., data, code, reports, etc.). This has benefits for the solo analyst and, especially, for anyone who wants to communicate and collaborate with others. Git helps you organize your project over time and across different people and computers. Hosting services like GitHub, Bitbucket, andGitLab provide a home for your Git-based projects on the internet.

What's special about using R and Git(Hub)?

- the active R package development community on GitHub
- workflows for R scripts and R Markdown files that make it easy to share source and rendered results on GitHub
- Git- and GitHub-related features of the RStudio IDE

Monday June 27, 2016 10:30am - 12:00pm PDT

Lyons & Lodato

Lyons & Lodato

An Introduction to Bayesian Inference using R Interfaces to Stan (Part 1)

**Speakers**
## Ben Goodrich

The Stan project implements a probabalistic programming language, a library of mathematical and statistical functions, and a variety of algorithms to estimate statistical models in order to make Bayesian inferences from data. The three main sections of this tutorial will

- Provide an introduction to modern Bayesian inference using Hamiltonian Markov Chain Monte Carlo (MCMC) as implemented in Stan.
- Teach the process of Bayesian inference using the rstanarm R package, which comes with all the necessary functions to support a handful of applied regression models that can be called by passing a formula and data.frame as the first two arguments (just like for glm).
- Demonstrate the power of the Stan language, which allows users to write a text file defining their own posterior distributions. The stan function in the rstan R package parses this file into C++, which is then compiled and executed in order to sample from the posterior distribution via MCMC.

Lecturer in the Discipline of Political Science, Columbia University

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

Monday June 27, 2016 1:00pm - 2:15pm PDT

SIEPR 130

SIEPR 130

Effective Shiny Programming (Part 1)

Shiny is a package for R that makes it easy to build interactive applications that combine the friendliness of a web page with the power of R. The goal of this tutorial is to help Shiny app authors to improve their Shiny skills, so that they can build apps that are easier to write, debug, and enhance.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 1:00pm - 2:15pm PDT

McCaw Hall

McCaw Hall

Extracting data from the web APIs and beyond (Part 1)
**Instructors: Karthik Ram, Garrett Grolemund and Scott Chamberlain**

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

**Background Knowledge** Familiarity with base R and ability to write functions.

**Requirements**

R with latest versions of httr, rvest, and curl. It would also be helpful to have a recent release of R and RStudio

**Target Audience**

Any R user with an interest in retrieving data from the web.

Website for materials: All material for the tutorial will be posted at: http://ropensci.github.io/user2016-tutorial/ (including instructions on packages that you'll need to install ahead of time).

More information and code available on our GitHub repository

**Speakers**
## Karthik Ram

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

R with latest versions of httr, rvest, and curl. It would also be helpful to have a recent release of R and RStudio

Any R user with an interest in retrieving data from the web.

Website for materials: All material for the tutorial will be posted at: http://ropensci.github.io/user2016-tutorial/ (including instructions on packages that you'll need to install ahead of time).

More information and code available on our GitHub repository

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Monday June 27, 2016 1:00pm - 2:15pm PDT

Campbell Rehearsal Hall

Campbell Rehearsal Hall

Introduction to SparkR (Part 1)

Apache Spark is a popular cluster computing framework used for performing large scale data analysis. This tutorial will introduce cluster computing using SparkR: the R language API for Spark. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users. In this tutorial we will provide example workflows for ingesting data, performing data analysis and doing interactive queries using distributed data frames. Finally, participants will be able to try SparkR on realworld datasets using Databricks R notebooks to get hands-on experience using SparkR.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 1:00pm - 2:15pm PDT

Econ 140

Econ 140

Missing Value Imputation with R (Part 1)

The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.

The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.

We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.

For details, refer to tutorial description.

The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.

We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.

For details, refer to tutorial description.

Monday June 27, 2016 1:00pm - 2:15pm PDT

Barnes

Barnes

Ninja moves with data.table - learn by doing in a cookbook style workshop (Part 1)

data.table is known for its speed on large data in RAM (e.g. 100GB) but it also has a consistent and flexible syntax for more advanced data manipulation tasks on small data too. First released to CRAN in 2006 it continues to grow in popularity. 180 CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted 4,000 questions from users in many fields making it a top 3 asked about R package. It is the 7th most starred R package on GitHub.

This three hour tutorial will guide complete beginners from basic queries through to advanced topics via examples you will run on your laptop. There is a short learning curve to data.table but once it clicks it sticks.

For details, refer to tutorial description.
Monday June 27, 2016 1:00pm - 2:15pm PDT

Lane

Lane

Regression Modeling Strategies and the rms package (Part 1)

**Speakers**
## Frank Harrell

The art of data analysis concerns using flexible statistical models, choosing tools wisely, avoiding overfitting, estimating quantities of interest, making statistical inferences and predictions, validating predictive accuracy, graphical presentation of complex models, and many other important techniques. Regression models can be extended in a number of ways to meet many of the modern challenges in data analysis. Software that makes it easier to incorporate modern statistical methods and good statistical practice removes obstacles and leads to greater insights from data. The presenter has striven to bring modern regression, missing data imputation, data reduction, and bootstrap model validation techniques into everyday practice by writing *Regression Modeling Strategies* (Springer, 2015, 2nd edition) and by writing an R package rms that accompanies the book. Detailed information may be found athttp://biostat.mc.vanderbilt.edu/rms.

The tutorial will cover two chapters in *Regression Modeling Strategies* related to general aspects of multivariable regression, relaxing linearity assumptions using restricted cubic splines, multivariable modeling strategy, and a brief introduction to bootstrap model validation. The rms package will be introduced, and at least two detailed case studies using the package will be presented. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.

Professor, Vanderbilt University School of Medicine

regression modeling; user of the S language for 25 years

Monday June 27, 2016 1:00pm - 2:15pm PDT

Lyons & Lodato

Lyons & Lodato

Understanding and creating interactive graphics (Part 1)

An interactive graphic invites the viewer to become an active partner in the analysis and allows for immediate feedback on how the data and results may change when inputs are modified. Interactive graphics can be extremely useful for exploratory data analysis, for teaching, and for reporting.

Because there are so many different kinds of interactive graphics, there has been an explosion in R packages that can produce them (e.g. animint, shiny, rCharts, rMaps, ggvis, htmlwidgets). A beginner with little knowledge of interactive graphics can thus be easily confused by (1) understanding what kinds of graphics are useful for what kinds of data, and (2) finding an R package that can produce the desired type of graphic. This tutorial solves these two problems by (1) introducing a vocabulary of keywords for understanding the different kinds of graphics, and (2) explaining what R packages can be used for each kind of graphic.

Attendees will gain hands-on experience with using R to create interactive graphics. We will discuss several example data sets and several R interactive graphics packages. Attendees will learn a vocabulary that helps to understand the strengths and weaknesses of the many different packages which are currently available.

For details, refer to tutorial description.
Monday June 27, 2016 1:00pm - 2:15pm PDT

McDowell & Cranston

McDowell & Cranston

Using R with Jupyter notebooks for reproducible research (Part 1)

**Speakers**
## Andrie De Vries

This tutorial introduces the Jupyter notebook project (previously called IPython notebooks). The tutorial will describe how Jupyter can be used for:

**interactive**coding and mark-up**sandbox**, sharing test code, notes (as markdown and comments) and results with colleagues, publishing snippets and useful findings**literate programming**with interactivity, similar to like SWeave and knitr, but with added interactivity- Multi-language interoperability, e.g. notebooks that include R as well as Python code.
**developing training**, using Jupyter notebooks locally or in the cloud**delivery of training**with persistent content, using a multi-tenant Jupyter notebook system

Programme Manager, Microsoft

I am a Programme Manager with Microsoft Data Science, based in London. During the last year, I've worked on Microsoft's community projects, including Microsoft R Open and MRAN. I am the maintainer for several packages that help reproducible research, specifically checkpoint and miniCRAN... Read More →

Monday June 27, 2016 1:00pm - 2:15pm PDT

SIEPR 120

SIEPR 120

An Introduction to Bayesian Inference using R Interfaces to Stan (Part 2)

**Speakers**
## Ben Goodrich

The Stan project implements a probabalistic programming language, a library of mathematical and statistical functions, and a variety of algorithms to estimate statistical models in order to make Bayesian inferences from data. The three main sections of this tutorial will

- Provide an introduction to modern Bayesian inference using Hamiltonian Markov Chain Monte Carlo (MCMC) as implemented in Stan.
- Teach the process of Bayesian inference using the rstanarm R package, which comes with all the necessary functions to support a handful of applied regression models that can be called by passing a formula and data.frame as the first two arguments (just like for glm).
- Demonstrate the power of the Stan language, which allows users to write a text file defining their own posterior distributions. The stan function in the rstan R package parses this file into C++, which is then compiled and executed in order to sample from the posterior distribution via MCMC.

Lecturer in the Discipline of Political Science, Columbia University

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

Monday June 27, 2016 2:30pm - 4:00pm PDT

SIEPR 130

SIEPR 130

Effective Shiny Programming (Part 2)
Shiny is a package for R that makes it easy to build interactive applications that combine the friendliness of a web page with the power of R. The goal of this tutorial is to help Shiny app authors to improve their Shiny skills, so that they can build apps that are easier to write, debug, and enhance.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 2:30pm - 4:00pm PDT

McCaw Hall

McCaw Hall

Extracting data from the web APIs and beyond (Part 2)
**Extracting data from the web APIs and beyond**

**Karthik Ram - UC Berkeley; Garrett Grolemund - RStudio, Inc.; Scott Chamberlain - rOpenSci**

**Tutorial Description**

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

**Background Knowledge**

Familiarity with base R and ability to write functions.

**Requirements**

R with latest versions of httr, rvest, and curl. It would also be helpful to have a recent release of R and RStudio.

Website for Materials

All material for the tutorial will be posted at: http://ropensci.github.io/user2016-tutorial/ (including instructions on packages that you'll need to install ahead of time).

**Speakers**
## Karthik Ram

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Familiarity with base R and ability to write functions.

R with latest versions of httr, rvest, and curl. It would also be helpful to have a recent release of R and RStudio.

Website for Materials

All material for the tutorial will be posted at: http://ropensci.github.io/user2016-tutorial/ (including instructions on packages that you'll need to install ahead of time).

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Monday June 27, 2016 2:30pm - 4:00pm PDT

Campbell Rehearsal Hall

Campbell Rehearsal Hall

Introduction to SparkR (Part 2)
Apache Spark is a popular cluster computing framework used for performing large scale data analysis. This tutorial will introduce cluster computing using SparkR: the R language API for Spark. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users. In this tutorial we will provide example workflows for ingesting data, performing data analysis and doing interactive queries using distributed data frames. Finally, participants will be able to try SparkR on realworld datasets using Databricks R notebooks to get hands-on experience using SparkR.

For details, refer to tutorial description.

For details, refer to tutorial description.

Monday June 27, 2016 2:30pm - 4:00pm PDT

Econ 140

Econ 140

Missing Value Imputation with R (Part 2)
The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.

The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.

We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.

For details, refer to tutorial description.

The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.

We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.

For details, refer to tutorial description.

Monday June 27, 2016 2:30pm - 4:00pm PDT

Barnes

Barnes

Ninja moves with data.table - learn by doing in a cookbook style workshop (Part 2)

data.table is known for its speed on large data in RAM (e.g. 100GB) but it also has a consistent and flexible syntax for more advanced data manipulation tasks on small data too. First released to CRAN in 2006 it continues to grow in popularity. 180 CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted 4,000 questions from users in many fields making it a top 3 asked about R package. It is the 7th most starred R package on GitHub.

This three hour tutorial will guide complete beginners from basic queries through to advanced topics via examples you will run on your laptop. There is a short learning curve to data.table but once it clicks it sticks.

For details, refer to tutorial description.
Monday June 27, 2016 2:30pm - 4:00pm PDT

Lane

Lane

Regression Modeling Strategies and the rms package (Part 2)

**Speakers**
## Frank Harrell

The art of data analysis concerns using flexible statistical models, choosing tools wisely, avoiding overfitting, estimating quantities of interest, making statistical inferences and predictions, validating predictive accuracy, graphical presentation of complex models, and many other important techniques. Regression models can be extended in a number of ways to meet many of the modern challenges in data analysis. Software that makes it easier to incorporate modern statistical methods and good statistical practice removes obstacles and leads to greater insights from data. The presenter has striven to bring modern regression, missing data imputation, data reduction, and bootstrap model validation techniques into everyday practice by writing *Regression Modeling Strategies* (Springer, 2015, 2nd edition) and by writing an R package rms that accompanies the book. Detailed information may be found athttp://biostat.mc.vanderbilt.edu/rms.

The tutorial will cover two chapters in *Regression Modeling Strategies* related to general aspects of multivariable regression, relaxing linearity assumptions using restricted cubic splines, multivariable modeling strategy, and a brief introduction to bootstrap model validation. The rms package will be introduced, and at least two detailed case studies using the package will be presented. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.

Professor, Vanderbilt University School of Medicine

regression modeling; user of the S language for 25 years

Monday June 27, 2016 2:30pm - 4:00pm PDT

Lyons & Lodato

Lyons & Lodato

Understanding and creating interactive graphics (Part 2)

An interactive graphic invites the viewer to become an active partner in the analysis and allows for immediate feedback on how the data and results may change when inputs are modified. Interactive graphics can be extremely useful for exploratory data analysis, for teaching, and for reporting.

Because there are so many different kinds of interactive graphics, there has been an explosion in R packages that can produce them (e.g. animint, shiny, rCharts, rMaps, ggvis, htmlwidgets). A beginner with little knowledge of interactive graphics can thus be easily confused by (1) understanding what kinds of graphics are useful for what kinds of data, and (2) finding an R package that can produce the desired type of graphic. This tutorial solves these two problems by (1) introducing a vocabulary of keywords for understanding the different kinds of graphics, and (2) explaining what R packages can be used for each kind of graphic.

Attendees will gain hands-on experience with using R to create interactive graphics. We will discuss several example data sets and several R interactive graphics packages. Attendees will learn a vocabulary that helps to understand the strengths and weaknesses of the many different packages which are currently available.

For details, refer to tutorial description.
Monday June 27, 2016 2:30pm - 4:00pm PDT

McDowell & Cranston

McDowell & Cranston

Using R with Jupyter notebooks for reproducible research (Part 2)

**Speakers**
## Andrie De Vries

This tutorial introduces the Jupyter notebook project (previously called IPython notebooks). The tutorial will describe how Jupyter can be used for:

**interactive**coding and mark-up**sandbox**, sharing test code, notes (as markdown and comments) and results with colleagues, publishing snippets and useful findings**literate programming**with interactivity, similar to like SWeave and knitr, but with added interactivity- Multi-language interoperability, e.g. notebooks that include R as well as Python code.
**developing training**, using Jupyter notebooks locally or in the cloud**delivery of training**with persistent content, using a multi-tenant Jupyter notebook system

Programme Manager, Microsoft

I am a Programme Manager with Microsoft Data Science, based in London. During the last year, I've worked on Microsoft's community projects, including Microsoft R Open and MRAN. I am the maintainer for several packages that help reproducible research, specifically checkpoint and miniCRAN... Read More →

Monday June 27, 2016 2:30pm - 4:00pm PDT

SIEPR 120

SIEPR 120

What's up with the R consortium?

**Moderators**

**Speakers**
## Joseph Rickert

The R Consortium is a business association organized under the Linux Foundation with a mission to support the R Community. Founded just before useR! 2015, it has already become a focus for R Community activities. During its first year, the R Consortium has begun to evaluate and fund projects while dealing with all of the internal start-up issues of developing internal structures, policies and operating procedures. In this talk, I will attempt to provide some insight into the workings of the R Consortium, describe the process behind the recent call for proposals, discuss the projects selected for funding so far, and provide some guidance on writing a proposal for the next round of funding which will close on July 10th.

Program Manager, Microsoft

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

Monday June 27, 2016 4:15pm - 4:40pm PDT

McCaw Hall

McCaw Hall

Presentation of the Women in R task force

**Moderators**

**Speakers**
## Heather Turner

Presentations and discussion regarding work of the Women in R task force.

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Monday June 27, 2016 4:40pm - 5:00pm PDT

McCaw Hall

McCaw Hall

R-Ladies Presentation 1

**Moderators**

**Speakers**
## Gabriela de Queiroz

Presentations and discussion regarding work of the R-Ladies from San Francisco.

Sr. Developer Advocate/Manager, IBM

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

Monday June 27, 2016 5:00pm - 5:10pm PDT

McCaw Hall

McCaw Hall

R-Ladies Presentation 2

**Moderators**

**Speakers**

Presentations and discussion regarding work of the R-Ladies from San Francisco and London.

Monday June 27, 2016 5:10pm - 5:20pm PDT

McCaw Hall

McCaw Hall

Discussion

**Moderators**

Presentations and discussion regarding work of the R consortium, Women in R task force, and R-Ladies from San Francisco and London.

Beverages and light refreshments will be provided after this session.

Beverages and light refreshments will be provided after this session.

Monday June 27, 2016 5:20pm - 5:30pm PDT

McCaw Hall

McCaw Hall

Forty years of S

Bell Labs in the 1970s was a hotbed of research in computing, statistics and many other fields. The conditions there encouraged the growth of the S language and influenced its content. The 40th anniversary of S is an appropriate time to relate a personal view of that scene and reflect on why S (and R) turned out as it did.

Tuesday June 28, 2016 9:00am - 10:00am PDT

McCaw Hall

McCaw Hall

bamdit: An R Package for Bayesian meta-analysis of diagnostic test data

**Moderators**
## Ben Goodrich

**Speakers**
## Pablo Emilio Verde

In this work we present the R package bamdit, its name stands for "Bayesian meta-analysis of diagnostic test-data". bamdit was developed with the aim of simplifying the use of models in meta-analysis, that up to now have demanded great statistical expertise in Bayesian meta-analysis. The package implements a series of innovative statistical techniques including: the Bayesian Summary Receiver Operating Characteristic curve, the use of prior distributions that avoid boundary estimation problems of component of variance and correlation parameters, analysis of conflict of evidence and robust estimation of model parameters. In addition, the package comes with several published examples of meta-analysis that can be used for illustration or further research in this area.

Lecturer in the Discipline of Political Science, Columbia University

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

Senior Researcher, University of Düsseldorf

I am an R user since June 1998. I translated the R messages to Spanish and I have been maintaining this translation during several years. My main interest in R is in the use of Bayesian data analysis and Bayesian meta-analysis in clinical research.

Tuesday June 28, 2016 10:30am - 10:48am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

edeaR: Extracting knowledge from process data

**Moderators**
## Emilio L. Cano

**Speakers**

During the last decades, the logging of events in a business context has increased massively. Information concerning activities within a broad range of business processes is recorded in so-called event logs. Connecting the domains of business process management and data mining, process mining aims at extracting process-related knowledge from these event logs, in order to gain competitive advantages. Over the last years, many tools for process mining analyses have been developed, having both commercial and academic origins. Nevertheless, most of them leave little room for extensions or interactive use. Moreover, they are not able to use existing data manipulation and visualization tools. In order to meet these shortcomings, the R-package edeaR was developed to enable the creation and analysis of event logs in R. It provides functionality to read and write logs from .XES-files, the eXtensible Event Stream format, which is the generally-acknowledged format for the interchange of event log data. By using the extensive spectrum of data manipulation methods in R, edeaR provides a very convenient way to build .XES-files from raw data, which is a cumbersome task in most existing process mining tools. Furthermore, the package contains a wide set of functions to describe and select event data, thereby facilitating exploratory and descriptive analysis. Being able to handle event data in R both empowers process miners to exploit the vast area of data analysis methods in R, and invites R-users to contribute to this rapidly emerging and promising field of process mining.

Researcher and Lecturer, Rey Juan Carlos University and the University of Castilla-La Mancha

Statistician, R enthusiast. Research topics: Statistical Process Control, Six Sigma methodology, Stochastic Optimization, energy market modelling.

Tuesday June 28, 2016 10:30am - 10:48am PDT

Econ 140

Econ 140

R in machine learning competitions

**Moderators**
## Heather Turner

**Speakers**
## Anthony Goldbloom

Kaggle is a community of almost 450K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons from winning techniques, with a particular emphasis on best practice R use.

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

CEO, Kaggle

Anthony is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.He holds a first class honours degree in economics and econometrics from the University of... Read More →

Tuesday June 28, 2016 10:30am - 10:48am PDT

McCaw Hall

McCaw Hall

Using Spark with Shiny and R Markdown

R is well-suited to handle data that can fit in memory but additional tools are needed when the amount of data you want to analyze in R grows beyond the limits of your machine’s RAM. There have been a variety of solutions to this problem over the years that aim to solve this problem in R; one of the latest options is Apache Spark™. Spark is a cluster computing tool that enables analysis of massive, distributed data across dozens or hundreds of servers. Spark now includes an integration with R via the SparkR package. Due to Spark’s ability to interact with distributed data little latency, it is becoming an attractive tool for interfacing with large datasets in an interactive environment. In addition to handling the storage of data, Spark also incorporates a variety of other tools including stream processing, computing on graphs, and a distributed machine learning framework. Some of these tools are available to R programmers via the SparkR package. In this talk, we’ll discuss how to leverage Spark’s capabilities in a modern R environment. In particular, we’ll discuss how to use Spark within an R Markdown document or even in an interactive Shiny application. We’ll also briefly discuss alternative approaches to working with large data in R and the pros and cons of using Spark.

Tuesday June 28, 2016 10:30am - 10:48am PDT

SIEPR 130

SIEPR 130

Wrapping Your R tools to Analyze National-Scale Cancer Genomics in the Cloud

**Moderators** *BL*
## Benoit Liquet

**Speakers**

The Cancer Genomics Cloud (CGC), built by Seven Bridges and funded by the National Cancer Institute hosts The Cancer Genome Atlas (TCGA), that is one of the world's largest cancer genomics data collections. Computational resources and optimized, portable bioinformatics tools are provided to analyze the cancer data at any scale immediately, collaboratively, and reproducibly. Seven Bridges platform is not only available on AWS but also available on google cloud as well. With Docker and Common Workflow Language open standard, wrapping a tool in any programming language into the cloud and compute on petabyte of data has never been so easy. Open source R/Bioconductor package ‘sevenbridges’ is developed to provide full API support to Seven Bridges Platforms including CGC, supporting flexible operations on project, task, file, billing, apps etc, users could easily develop fully automatic workflow within R to do an end-to-end data analysis in the cloud, from raw data to report. What’s most important, ‘sevenbridges’ packages also provides interface to describe your tools in R and make it portable to CWL format in JSON and YAML, that you can share easily with collaborators, execute it in different environment locally or in the cloud, everything is fully reproducible. Combined with the R API client functionality, users will be able to create a CWL tool in R and execute it in the cancer genomics cloud to analyze the huge amount of cancer data at scale.

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

Tuesday June 28, 2016 10:30am - 10:48am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Compiling parts of R using the NIMBLE system for programming algorithms

**Moderators**
## Ben Goodrich

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

**Speakers**

The NIMBLE R package provides a flexible system for programming statistical algorithms for hierarchical models specified using the BUGS language. As part of the system, we compile R code for algorithms and seamlessly link the compiled objects back into R, with our focus being on mathematical operations. Our compiler first generates C++, including Eigen code for linear algebra, before the usual compilation process. The NIMBLE compiler was written with extensibility in mind, such that adding new operations for compilation requires only a few well-defined additions to the code base. We'll describe how one can easily write functions in R and automatically compile them, as well as how the compiler operates behind the scenes. Functions can be stand-alone functions or can be functions that interact with hierarchical models written in BUGS code, which NIMBLE converts to a set of functions and data structures that are also compiled via C++. Finally, we'll show how the system has been used to build a full suite of MCMC and sequential Monte Carlo algorithms that can be used on any hierarchical model.

Lecturer in the Discipline of Political Science, Columbia University

Tuesday June 28, 2016 10:48am - 11:06am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Connecting R to the OpenML project for Open Machine Learning

**Moderators**
## Heather Turner

**Speakers**
## Joaquin Vanschoren

OpenML is an online machine learning platform where researchers can automatically log and share data, code, and experiments, and organize them online to work and collaborate more effectively. We present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. We show how the OpenML package allows R users to easily search, download and upload machine learning datasets. Users can easily log their auto ML experiment results online, have them evaluated on the server, share them with others and download results from other researchers to build on them. Beyond ensuring reproducibility of results, it automates much of the drudge work, speeds up research, facilitates collaboration and increases user's visibility online. Currently, OpenML has 1,000+ registered users, 2,000+ unique monthly visitors, 2,000+ datasets, and 500,000+ experiments. The OpenML server currently supports client interfaces for Java, Python, .NET and R as well as specific interfaces for the WEKA, MOA, RapidMiner, scikit-learn and mlr toolboxes for machine learning.

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Assistant Professor, Eindhoven University of Technology

My research focuses on the automation and democratization of machine learning. I founded OpenML.org, a collaborative machine learning platform where scientists can automatically log and share data, code, and experiments, and which automatically learns from all this data to help people... Read More →

Tuesday June 28, 2016 10:48am - 11:06am PDT

McCaw Hall

McCaw Hall

Implementing R in old economy companies: From proof-of-concept to production

**Moderators**
## Emilio L. Cano

**Speakers**
## Oliver Bracht

In old economy companies, the introduction of R is typically a button-up process that follows a pattern of three major stages of maturity: At the first stage, guerrilla projects use R parallel to the "official" IT environment. The usage of R is often initiated by interns, student assistants or newly recruited graduates. At the second stage, when the results of the guerrilla projects attract the attention of business departments, R is used as analytic language in proof-of-concept projects. When the proof-of-concept has been successful, the outcome shall be transferred to the production system. At this stage R is being introduced “officially” to the IT environment. While the first and second level of maturity usually do not cause any major problems, the step to the third level is most crucial for the long term success of the implementation of R. This talk will focus on how to master the switch from proof-of-concept to production. It will show based on real world experiences typical road blocks as well as the most important success factors.

Researcher and Lecturer, Rey Juan Carlos University and the University of Castilla-La Mancha

Statistician, R enthusiast. Research topics: Statistical Process Control, Six Sigma methodology, Stochastic Optimization, energy market modelling.

Chief Data Scientist, eoda GmbH

At the useR Conference I am excited to meet the community and to see what's new. I am exspecially interessted in business cases and the professional usage of R within companies.

Tuesday June 28, 2016 10:48am - 11:06am PDT

Econ 140

Econ 140

RcppParallel: A Toolkit for Portable, High-Performance Algorithms

**Moderators**

**Speakers**
## Kevin Ushey

Modern computers and processors provide many advanced facilities for the concurrent, or parallel, execution of code. While R is a fundamentally a single-threaded program, it can call into multi-threading code, provided that such code interacts with R in a thread-safe manner. However, writing concurrent programs that run both safely and correctly is a very difficult task, and requires substantial expertise when working with the primitives provided by most programming languages or libraries. RcppParallel provides a complete toolkit for creating safe, portable, high-performance parallel algorithms, built on top of the Intel "Threading Building Blocks" (TBB) and "TinyThread" libraries. In particular, RcppParallel provides two high-level operations -- 'parallelFor', and 'parallelReduce', which provide a framework for the safe, performant implementation of many kinds of parallel algorithms. We'll showcase how RcppParallel might be used to implement a parallel algorithm, and how the generated routine could be used in an R package.

Software Engineer, RStudio

Software engineer on a team using C++, Java, JavaScript, and R to build an IDE for the R programming language.Say hi if you want to talk about how weird / awesome R is :)

Tuesday June 28, 2016 10:48am - 11:06am PDT

SIEPR 130

SIEPR 130

Two-sample testing in high dimensions

**Moderators** *BL*
## Benoit Liquet

**Speakers**

Estimation for high-dimensional models has been widely studied. However, uncertainty quantification remains challenging. We put forward novel methodology for two-sample testing in high dimensions (Städler and Mukherjee, JRSSB, 2016). The key idea is to exploit sparse structure in the construction of the test statistics and in p-value calculation. This renders the test effective but leads to challenging technical issues that we solve via novel theory that extends the likelihood ratio test to the high-dimensional setting. For computation we use randomized data-splitting: sparsity structure is estimated using the first half of the data, and p-value calculation is carried out using the second half. P-values from multiple splits are aggregated to give a final result. Our test is very general and applicable to any model class where sparse estimation is possible. We call the application to graphical models Differential Network. Our method is implemented in the recently released Bioconductor package nethet. Besides code for high-dimensional testing the package provides other tools for exploring heterogeneity from high-dimensional data. For example, we make a novel network-based clustering algorithm available and provide several visualization functionalities. Molecular networks play a central role in biology. An emerging notion is that networks themselves are thought to differ between biological contexts, such as cell type, tissue type, or disease state. As an example we consider protein data from The Cancer Genome Atlas. Differential Network applied to this data set provides evidence over thousands of patient samples in support of the notion that cancers differ at the protein network level.

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

Tuesday June 28, 2016 10:48am - 11:06am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Bayesian analysis of generalized linear mixed models with JAGS

**Moderators**
## Ben Goodrich

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

**Speakers**
## Martyn Plummer

BUGS is a language for describing hierarchical Bayesian models which syntactically resembles R. BUGS allows large complex models to be built from smaller components. JAGS is a BUGS interpreter written in C++ which enables Bayesian inference using Markov Chain Monte Carlo (MCMC). Several R packages provide interfaces to JAGS (e.g. jags, runjags, R2jags, bayesmix, iBUGS, jagsUI, HydeNet). The efficiency of MCMC depends heavily on the sampling methods used. Therefore a key function of the JAGS interpreter is to identify design motifs in a large complex Bayesian model that have well-characterized MCMC solutions and apply the appropriate sampling methods. Generalized linear models (GLMs) form a recurring design motif in many hierarchical Bayesian models. Several data augmentation schemes have been proposed that reduce a GLM to a linear model and allow efficient sampling of the coefficients. These schemes are implemented in the glm module of JAGS. The glm module also includes an interface to the sparse matrix algebra library CHOLMOD, allowing the analysis of GLMs with sparse design matrices. The use of sparse matrices in a Bayesian GLM renders the distinction between ``fixed'' and ``random'' effects irrelevant and allows all coefficients of a generalized linear mixed model to be sampled in the same comprehensive framework.

Lecturer in the Discipline of Political Science, Columbia University

International Agency for Research on Cancer

I am a member of the R Core Team and co-president of the R Foundation. My main methodological interest is in Bayesian inference and my applied work is in cancer epidemiology.

Tuesday June 28, 2016 11:06am - 11:24am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

R/qtl: Just Barely Sustainable

**Moderators** *BL*
## Benoit Liquet

**Speakers**
## Karl Broman

R/qtl is an R package for mapping quantitative trait loci (genetic loci that contribute to variation in quantitative traits, such as blood pressure) in experimental crosses (such as in mice). I began its development in 2000; there have been 46 software releases since 2001. The latest version contains 39k lines of R code, 24k lines of C code, and 16k lines of code for the documentation. The continued development and maintenance of the software has been challenging. I'll describe my experiences in developing and maintaining the package and in providing support to users. I'm currently working on a re-implementation of the package to better handle high-dimensional data and more complex experimental crosses. I'll describe my efforts to avoid repeating the mistakes I made the first time around.

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

Professor, University of Wisconsin-Madison

Karl Broman is Professor in the Department of Biostatistics & Medical Informatics at the University of Wisconsin–Madison; research in statistical genetics; developer of R/qtl. Recently he has been focusing on interactive data visualization; see his R/qtlcharts package and his D... Read More →

Tuesday June 28, 2016 11:06am - 11:24am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

R: The last line of defense against bad debt

**Moderators**
## Emilio L. Cano

**Speakers**

During the last decade, data has changed the behaviors of individuals and corporations alike. On the latter, Advanced Analytics has gained considerable momentum – not only as a source of competitive advantage in the short-term, but also the risk of becoming obsolete in the medium-term. In this context, data scientists cannot offer solutions on data-reach problems without leveraging the opportunities of statistical learning with a tool like R, which allows to rapidly transform prototypes into useful solutions, thanks to the functional nature of R. To be specific we will focus on a particular and pressing issue across industries, geographies and organizations: collections & bad debt. We will show how machine learning algorithms leveraging R helped shape better solutions on a “millennial” problem (i.e. how am I getting paid back?) During this talk, we will show how, with the help of R as our main power horse, we approach a collection problem from its inception to the actual business implementation. First, we will describe how we can preprocess the data that may be useful for the purpose of predicting which customers are going to fail to pay their bills. Then, we will explore the relationship between the past payment behavior of a customer and his ability to satisfy future obligations. Finally, we will conclude sharing briefly how the output of a prediction model can be translated into effective business strategies using a project we have been involved on recently as an example.

Researcher and Lecturer, Rey Juan Carlos University and the University of Castilla-La Mancha

Statistician, R enthusiast. Research topics: Statistical Process Control, Six Sigma methodology, Stochastic Optimization, energy market modelling.

Tuesday June 28, 2016 11:06am - 11:24am PDT

Econ 140

Econ 140

Taking R to new heights for scalability and performance

**Moderators**

**Speakers**
## Mark Hornick

Big Data is all the rage, but how can enterprises extract value from such large accumulations of data as found in the growing corporate “data lakes” or “data reservoirs.” The ability to extract value from big data demands high performance and scalable tools – both in hardware and software. Increasingly, enterprises take on massive predictive modeling projects, where the goal is to build models on multi-billion row tables or build thousands or millions of models. Data scientists need to address use cases that range from modeling individual customer behavior to understand aggregate behavior or tailoring predictions at the individual customer level, to monitoring sensors from the Internet of Things for anomalous behavior. While R is cited as the most used statistical language, limitations of scalability and performance often restrict its use for big data. In this talk, we present scenarios both on Hadoop and database platforms using R. We illustrate how Oracle Advanced Analytics’ R Enterprise interface and Oracle R Advanced Analytics for Hadoop enable taking R to new heights for scalability and performance.

Senior Director, Oracle

Mark Hornick is the Senior Director of Product Management for the Oracle Machine Learning (OML) family of products. He leads the OML PM team and works closely with Product Development on product strategy, positioning, and evangelization, Mark has over 20 years of experience with integrating... Read More →

Tuesday June 28, 2016 11:06am - 11:24am PDT

SIEPR 130

SIEPR 130

trackeR: Intrastructure for running and cycling data from GPS-enabled tracking devices in R

**Moderators**
## Heather Turner

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

**Speakers**

The use of GPS-enabled tracking devices and heart rate monitors is becoming increasingly common in sports and fitness activities. The trackeR package aims to fill the gap between the routine collection of data from such devices and their analyses in a modern statistical environment like R. The package provides methods to read tracking data and store them in session-based, unit-aware, and operation-aware objects of class trackeRdata. The package also implements core infrastructure for relevant summaries and visualisations, as well as support for handling units of measurement. There are also methods for relevant analytic tools such as time spent in zones, work capacity above critical power (known as W'), and distribution and concentration profiles. A case study illustrates how the latter can be used to summarise the information from training sessions and use it in more advanced statistical analyses.

Freelance consultant

Tuesday June 28, 2016 11:06am - 11:24am PDT

McCaw Hall

McCaw Hall

Automating our work away: One consulting firm's experience with KnitR

**Moderators**
## Emilio L. Cano

Statistician, R enthusiast. Research topics: Statistical Process Control, Six Sigma methodology, Stochastic Optimization, energy market modelling.

**Speakers**

As consultants, many of the projects that we work on are similar, with many steps repeated verbatim across projects. Previously, our workflow was based largely in Microsoft Office, with our analysis done manually in Excel, our reports written in Word, and our presentations in Powerpoint. In 2015, we began using R for much of our analysis, including making slide decks and reports in RMarkdown. Our presentation discusses why we made the change, how we managed it, and advice for other consulting firms looking to do the same.

Researcher and Lecturer, Rey Juan Carlos University and the University of Castilla-La Mancha

Tuesday June 28, 2016 11:24am - 11:42am PDT

Econ 140

Econ 140

Distributed Computing using parallel, Distributed R, and SparkR

**Moderators**

**Speakers**

Data volume is ever increasing, while single node performance is stagnate. To scale, analysts need to distribute computations. R has built-in support for parallel computing, and third-party contributions, such as Distributed R and SparkR, enable distributed analysis. However, analyzing large data in R remains a challenge, because interfaces to distributed computing environments, like Spark, are low-level and non-idiomatic. The user is effectively coding for the underlying system, instead of writing natural and familiar R code that produces the same result across computing environments. This talk focuses on how to scale R-based analyses across multiple cores and to leverage distributed machine learning frameworks through the ddR (Distributed Data structures in R) package, a convenient, familiar, and idiomatic abstraction that helps to ensure portability and reproducibility of analyses. The ddR package defines a framework for implementing interfaces to distributed environments behind the canonical base R API. We will discuss key programming concepts and demonstrate writing simple machine learning applications. Participants will learn about creating parallel applications from scratch as well as invoking existing parallel implementations of popular algorithms, like random forest and kmeans clustering.

Tuesday June 28, 2016 11:24am - 11:42am PDT

SIEPR 130

SIEPR 130

Fitting complex Bayesian models with R-INLA and MCMC

**Moderators**
## Ben Goodrich

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

**Speakers**

The Integrated Nested Laplace Approximation (INLA) provides a computationally efficient approach to obtaining an approximation to the posterior marginals for a large number of Bayesian models. In particular, INLA focuses on those models that can be expressed as a Latent Gaussian Markov Random field. Its associated R package, R-INLA, implements a number of functions to easily fit many of these models. However, it is not easy to implement new latent models or priors. Bivand et al. (2014) proposed a way of using R-INLA to fit models that are not implemented, by fixing some parameters in the model and then combining the fitted models using Bayesian Model Averaging (BMA). This is implemented in the INLABMA R package. An interesting feature of this approach is that it allows Bayesian models to be fitted in parallel. Recently, Gomez-Rubio et al. (2016) have proposed the use of MCMC and INLA together to fit more complex models. This approach allows INLA to fit models with unimplemented (or multivariate) priors, missing data in the covariates and many more latent models. Finally, we will explore how these ideas can be applied to fit models to Big Data. This involves fitting models to separate chunks of data with R-INLA and then combining the output to obtain an approximation to the model with all the data.

References: Bivand et al. (2014). Approximate Bayesian Inference for Spatial Econometrics Models. Spatial Statistics 9, 146-165.

Gomez-Rubio et al. (2016). Extending INLA with MCMC. Work in progress.

References: Bivand et al. (2014). Approximate Bayesian Inference for Spatial Econometrics Models. Spatial Statistics 9, 146-165.

Gomez-Rubio et al. (2016). Extending INLA with MCMC. Work in progress.

Lecturer in the Discipline of Political Science, Columbia University

Tuesday June 28, 2016 11:24am - 11:42am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Fry: A Fast Interactive Biological Pathway Miner

**Moderators** *BL*
## Benoit Liquet

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

**Speakers**

Gene set tests are often used in differential expression analyses to explore the behavior of a group of related genes. This is useful for identifying large-scale co-regulation of genes belonging to the same biological process or molecular pathway. One of the most flexible and powerful gene set tests is the ROAST method in the limma R package. ROAST uses residual space rotation as a sort of continuous version of sample permutation. Like permutation tests, it protects against false positives caused by correlations between genes in the set. Unlike permutation tests, it can be used with complex experimental design and with small numbers of replicates. It is the only gene set test method that is able to analyse complex “gene expression signatures” that incorporate information about both up and down regulated genes simultaneously. ROAST works well for individual expression signatures, but has limitations when applied to large collections of gene sets, such as the Broad Institute’s Molecular Signature Database with over 8000 gene sets. In particular, the p-value resolution is limited by the number of rotations that are done for each set. This makes it impossible to obtain very small p-values and hence to distinguish the top ranking pathways from a large collection. As with permutation tests, the p-values for each set may vary from run to run. This talk presents Fry, a very fast approximation to the complete ROAST method. Fry approximates the limiting p-value that would be obtained from performing a very large number of rotations with ROAST. Fry preserves most of the advantages of ROAST, but also provides high resolution exact p-values very quickly. In particular, it is able to distinguish the most significant sets in large collections and to yield statistically significant results after adjustment for multiple testing. This makes it an ideal tool for large-scale pathway analysis. Another important consideration in gene set tests is the possible unbiased or incorrect estimation of P-values due to the correlation among genes in the same set or dependence structure between different sets.

Tuesday June 28, 2016 11:24am - 11:42am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

United Nations World Population Projections with R

**Moderators**
## Heather Turner

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

**Speakers**

Recently, the United Nations adopted a probabilistic approach to projecting fertility, mortality and population for all countries. In this approach, the total fertility and female and male life expectancy at birth are projected using Bayesian hierarchical models estimated via Markov Chain Monte Carlo. They are then combined yielding probabilistic projections for any population quantity of interest. The methodology is implemented in a suite of R packages which has been used by the UN to produce the most recent revision of the World Population Prospects. I will summarize the main ideas behind each of the packages, namely bayesTFR, bayesLife, bayesPop, bayesDem, and the shiny-based wppExplorer. I will also touch on our experience of the collaboration between academics and the UN.

Freelance consultant

Tuesday June 28, 2016 11:24am - 11:42am PDT

McCaw Hall

McCaw Hall

Analysis of big biological sequence datasets using the DECIPHER package

**Moderators** *BL*
## Benoit Liquet

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

**Speakers**

Recent advances in DNA sequencing have led to the generation of massive amounts of biological sequence data. As a result, there is an urgent need for packages that assist in organizing and evaluating large collections of sequences. The DECIPHER package enables the construction of databases for curating sequence sets in a space-efficient manner. Sequence databases offer improved organization and greatly reduce memory requirements by allowing subsets of sequences to be accessed independently. Using DECIPHER, sequences can be imported into a database, explored, viewed, and exported under non-destructive workflows that simplify complex analyses. For example, DECIPHER workflows could be used to quickly search for thousands of short sequences (oligonucleotides) within millions of longer sequences that are contained in a database. DECIPHER also includes state-of-the-art functions for sequence alignment, primer/probe design, sequence manipulation, phylogenetics, and other common bioinformatics tasks. Collectively, these features empower DECIPHER users to handle big biological sequence data using only a regular laptop computer.

Tuesday June 28, 2016 11:42am - 12:00pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

bayesboot: An R package for easy Bayesian bootstrapping

**Moderators**
## Ben Goodrich

Ben Goodrich is a core developer of Stan, which is a collection of statistical software for Bayesian estimation of models, and is the maintainer of the corresponding rstan and rstanarm R packages. He teaches in the political science department and in the Quantitative Methods in the... Read More →

**Speakers**
## Rasmus Arnling Bååth

Introduced by Rubin in 1981, the Bayesian bootstrap is the Bayesian analogue to the classical non-parametric bootstrap and it shares the classical bootstrap's advantages: It is a non-parametric method that makes weak distributional assumptions and that can be used to calculate uncertainty intervals for any summary statistic. Therefore, it can be used as an inferential tool even when the data is not well described by standard distributions, for example, in A/B testing or in regression modeling. The Bayesian bootstrap can be seen as a smoother version of the classical bootstrap. But it is also possible to view the classical bootstrap as an approximation to the Bayesian bootstrap.

In this talk I will explain the model behind the Bayesian bootstrap, how it connects to the classical bootstrap and in what situations the Bayesian bootstrap is useful. I will also show how one can easily perform Bayesian bootstrap analyses in R using my package bayesboot (https://cran.r-project.org/package=bayesboot).

In this talk I will explain the model behind the Bayesian bootstrap, how it connects to the classical bootstrap and in what situations the Bayesian bootstrap is useful. I will also show how one can easily perform Bayesian bootstrap analyses in R using my package bayesboot (https://cran.r-project.org/package=bayesboot).

Lecturer in the Discipline of Political Science, Columbia University

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Tuesday June 28, 2016 11:42am - 12:00pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

FlashR: Enable Parallel, Scalable Data Analysis in R

**Moderators**

**Speakers**

In the era of big data, R is rapidly becoming one of the most popular tools forndata analysis. But the R framework is relatively slow and unablento scale to large datasets. The general approach of speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. There are many works that parallelize R andnscale it to large datasets. For example, Revolution R Open parallelizes a limited set of matrix operations individually, which limits its performance. Others such as Rmpi and R-Hadoop exposes low-level programmingninterface to R users and require more explicit parallelization. It is challenging to provide a framework that has a high-level programming interface while achieving efficiency. FlashR is a matrix-oriented R programming framework that supports automatic parallelization and out-of-core execution for large datasets. FlashR reimplements matrix operations in the R base package and provides some generalized matrix operations to improve expressiveness. FlashR automatically fuses matrix operations to reduce data movement between CPU and disks. We implement machine learning algorithms such as Kmeans and GMM in FlashR to benchmark its performance. On a large parallelnmachine, both in-memory and out-of-core execution of these R implementations in FlashR significantly outperforms the ones in Spark Mllib. We believe FlashR significantly lowers the expertise for writing parallel and scalable implementations of machine learning algorithms and provides new opportunities for large-scale machine learning in R. FlashR is implemented as an R package and is released as open source (http://flashx.io/).

Tuesday June 28, 2016 11:42am - 12:00pm PDT

SIEPR 130

SIEPR 130

How can I get everyone else in my organisation to love R as much as I do?

**Moderators**
## Emilio L. Cano

Statistician, R enthusiast. Research topics: Statistical Process Control, Six Sigma methodology, Stochastic Optimization, energy market modelling.

**Speakers**

Learning R is dangerous. It entices us in by presenting an incredibly powerful tool to solve our particular problem; for free! And as we learn how to do that, we uncover more things that make our solution even better. But then we start to look around our organisation or institution and see how it could make everyone's lives better too. And that's the dangerous part; R's got us hooked and we can't give up the belief that everyone else should be using this, right now. Even though R is free, open source software, there are often barriers to introducing it organisation-wide. This could be because of such things as IT or quality policies, the need for management buy-in or because of perceptions in learning the language. This presentation will first discuss the aspects required to understand these barriers to entry, and the different types of resolution for these. It will then use three projects to show how, by understanding the requirements of the organisation, and developing situation-specific roll-out strategies, these barriers to entry can be overcome. The first example is a large organisation who wanted to quickly (within 6 weeks) show management how Shiny could improve information dissemination. As server policies made a proof of concept difficult to run internally, this project used a cloud hosted environment for R, Shiny and a source database. The second example is around two SME's who required access to a validated version of R, which was provided via the Amazon and Azure marketplaces. The key aspect of these projects is the value to IT departments of being able to distribute a pre-configured machine around the organisation.

Researcher and Lecturer, Rey Juan Carlos University and the University of Castilla-La Mancha

Tuesday June 28, 2016 11:42am - 12:00pm PDT

Econ 140

Econ 140

jailbreakr: Get out of Excel, free

**Moderators**
## Heather Turner

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

**Speakers**

One out of every ten people on the planet uses a spreadsheet and about half of those use formulas: "Let's not kid ourselves: the most widely used piece of software for statistics is Excel." (Ripley, 2002) Those of us who script analyses are in the distinct minority! There are several effective packages for importing spreadsheet data into R. But, broadly speaking, they prioritize access to [a] data and [b] data that lives in a neat rectangle. In our collaborative analytical work, we battle spreadsheets created by people who did not get this memo. We see messy sheets, with multiple data regions sprinkled around, mixed with computed results and figures. Data regions can be a blend of actual data and, e.g., derived columns that are computed from other columns. We will present our work on extracting tricky data and formula logic out of spreadsheets. To what extent can data tables be automatically identified and extracted? Can we identify columns that are derived from others in a wholesale fashion and translate that into something useful on the R side? The goal is to create a more porous border between R and spreadsheets. Target audiences include novices transitioning from spreadsheets to R and experienced useRs who are dealing with challenging sheets.

Freelance consultant

Tuesday June 28, 2016 11:42am - 12:00pm PDT

McCaw Hall

McCaw Hall

Capturing and understanding patterns in plant genetic resource data to develop climate change adaptive crops using the R platform

**Moderators**

**Speakers**
## Abdallah Bari

Genetic resources consist of genes and genotypes with patterns reflecting their dynamic adaption to changing environmental conditions. Detailed understanding of these patterns will significantly enhance the potential of developing crops with adaptive traits to climate change. Genetic resources have contributed in the past to about 50 percent increase in crop yields through genetic improvements, further improvement and development of climate change resilient crops will largely depend on these natural resources. However, the datasets associated with these resources are very large and consist mostly of records of single observations or/and continuous functions with limited information on key variables. Analysis of such complex and large datasets requires new mathematical conceptual frameworks, and a flexible evolving platform for a timely and continuous utilization of these resources to accelerate the identification of genetic material or genes that could be used for improving the resilience of food crops to climate change. In this global collaborative research and during the development of the theoretical framework, numerous modelling routines have been tested, including linear and nonlinear approaches on the R platform. The results were validated and used for the identification of sources of important traits such as drought, salinity and heat tolerance. This paper presents the conceptual framework with applications in R used in the identification of crop germplasm with climate change adaptive traits. The paper addresses the dynamics as well as the specificity of genetic resources data, which consists not only of records of mostly single observations but also functional data.

Researcher, Data and Image Analytics - Montreal

Abdallah Bari is a researcher focusing on applied mathematics in research. He received his PhD in imaging techniques to assess genetic variation from the University of Cordoba, Spain. His research involves elaborating and applying mathematical models and theoretical aspects to seek... Read More →

Tuesday June 28, 2016 1:00pm - 1:18pm PDT

SIEPR 120

SIEPR 120

Continuous Integration and Teaching Statistical Computing with R

**Moderators**

**Speakers**

In this talk we will discuss two statistical computing courses taught as part of the undergraduate and masters curriculum in the Department of Statistical Science at Duke University. The primary goal of these courses is to teach advanced R along with modern software development practices. In this talk we will focus in particular on our adoption of continuous integration tools (github and wercker) as a way to automate and improve the feedback cycle for students as they work on their assignments. Overall, we have found that these tools, when used appropriately, help reduce learner frustration, improves code quality, reduces instructor workload, and introduces powerful tools that are relevant long after the completion of the course. We will discuss several of the classes' open-ended assignments and explore instances where continuous integration made sense and well as cases where it did not.

Tuesday June 28, 2016 1:00pm - 1:18pm PDT

SIEPR 130

SIEPR 130

Group and sparse group partial least squares approaches applied in a genomics context

**Moderators**

**Speakers** *BL*
## Benoit Liquet

University Pau et Pays de L'Adour, ACEMS: Centre of Excellence for Mathematical and Statistical Frontiers, QUT, Australia

In this talk, I will concentrate on a class of multivariate statistical methods called Partial Least Squares (PLS). They are used for analysing the association between two blocks of ‘omics’ data, which bring challenging issues in computational biology due to their size and complexity. In this framework, we will exploit the knowledge on the grouping structure existing in the data, which is key to more accurate prediction and improved interpretability. For example, genes within the same pathway have similar functions and act together in regulating a biological system. In this context, we developed a group Partial Least Squares (gPLS) method and a sparse gPLS (sgPLS) method. Our methods available through our sgPLS R package are compared through an HIV therapeutic vaccine trial. Our approaches provide parsimonious models to reveal the relationship between gene abundance and the immunological response to the vaccine.

Tuesday June 28, 2016 1:00pm - 1:18pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Linking htmlwidgets with crosstalk and mobservable

**Moderators**
## Torben Tvedebrink

**Speakers**

The htmlwidgets package makes it easy to create interactive JavaScript widgets from R, and display them from the R console or insert them into R Markdown documents and Shiny apps. These widgets exhibit interactivity "in the small": they can interact with mouse clicks and other user gestures within their widget boundaries. This talk will focus on interactivity "in the large", where interacting with one widget results in coordinated changes in other widgets (for example, select some points in one widget and the corresponding observations are instantly highlighted across the other widgets). This kind of inter-widget interactivity can be achieved by writing a Shiny app to coordinate multiple widgets (and indeed, this is a common way to use htmlwidgets). But some situations call for a more lightweight solution. crosstalk and robservable are two distinct but complementary approaches to the problem of widget coordination, authored by myself and Ramnath Vaidyanathan, respectively. Each augments htmlwidgets with pure-JavaScript coordination logic; neither requires Shiny (or indeed any runtime server support at all). The resulting documents can be hosted on GitHub, RPubs, Amazon S3, or any static web host. In this talk, I'll demonstrate these new tools, and discuss their advantages and limitations compared to existing approaches.

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Tuesday June 28, 2016 1:00pm - 1:18pm PDT

McCaw Hall

McCaw Hall

Experiences on the Use of R in the Water Sector

**Moderators**

**Speakers**
## David Ibarra

In this study we present some real cases where R has been a key element on building decision support systems related to the water industry. We have used R in the context of automatic water demand forecast, its application to optimal pumping scheduling and building a framework to offer these algorithms as a service (using RInside, Rcpp, MPI, RProtobuf among others) to easily integrate our work on heterogeneous environments. We have used an HPC cluster with R to solve big problems faster. About water demand forecast we used several tools like lineal models, neural networks or tree based method. On short term we included also weather forecast variables. The selection of the method is carried out dynamically (or online) using out-of-sample recent data. The optimal pumping schedule model is loaded with LPSolveAPI package and solved with CBC. We produce nice HTML5 reports of the solutions using googleVis package.

Senior Data Scientist. Leading Data Science Area at Business Analytics, Aqualogy Business Software

Currently I'm a PhD candidate on University of Alicante on Computer Sciences (High Performace Computing, Parallel programming) and I'm also full-time Senior Data Scientist on Aqualogy Business Software.
During 2016 I became a Senior Data Scientist at VidaCaixa (Life Insurance... Read More →

Tuesday June 28, 2016 1:18pm - 1:36pm PDT

SIEPR 120

SIEPR 120

Fast additive quantile regression in R

**Moderators**

**Speakers**

Quantile regression represents a flexible approach for modelling the impact of several covariates on the conditional distribution of the dependent variable, which does not require making any parametric assumption on the observations density. However, fitting quantile regression models using the traditional pinball loss is computationally expensive, due to the non-differentiability of this function. In addition, if this loss is used, extending quantile regression to the context of non-parametric additive models become difficult. In this talk we will describe how the computational burden can be reduced, by approximating the pinball loss with a differentiable function. This allows us to exploit the computationally efficient approach described by [1], and implemented by the mgcv R package, to fit smooth additive quantile models. Beside this, we will show how the smoothing parameters can be selected in a robust fashion, and how reliable uncertainty estimated can be obtained, even for extreme quantiles. We will demonstrate this approach, which is implemented by an upcoming extension of mgcv, in the context of probabilistic forecasting of electricity demand. [1] Wood, S. N., N. Pya, and B. Safken (2015). Smoothing parameter and model selection for general smooth models. http://arxiv.org/abs/1511.03864

Tuesday June 28, 2016 1:18pm - 1:36pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

htmlwidgets: Power of JavaScript in R

**Moderators**
## Ioannis Kosmidis

**Speakers**

htmlwidgets is an R package that provides a comprehensive framework to create interactive javascript based widgets, for use from R. Once created, these widgets can be used at the R console, embedded in an R Markdown report, or even used inside a Shiny web application. In this talk, I will introduce the concept of a "htmlwidget", and discuss how to create, develop and publish a new widget from scratch. I will discuss how multiple widgets can be composed to create dashboards and interactive reports. Finally, I will touch upon more advanced functionality like auto-resizing and post-render callbacks, and briefly discuss some of the exciting developments in this area. There has been significant interest in the R community to bring more interactivity into visualizations, reports and applications. The htmlwidgets package is an attempt to simplify the process of developing interactive widgets, and publishing them for more widespread usage in the R community.

Associate Professor, Department of Statistical Science, University College London

I am a Senior Lecturer at the Department of Statistical Science in University College London. My theoretical and methodological research focuses on optimal estimation and inference from complex statistical models, penalized likelihood methods and clustering. A particular focus of... Read More →

Tuesday June 28, 2016 1:18pm - 1:36pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Integrated R labs for high school students

The Mobilize project developed a year-long high school level Introduction to Data Science course, which has been piloted in 27 public schools in the Los Angeles Unified School District. The curriculum is innovative in many ways, including the use of R and the associated curricular support materials. Broadly, there are three main approaches to teaching R. One has users learning to code in their browser (Code School and DataCamp), another has them working directly in the R console (swirl), and a final approach is to have students follow along with an external document (OpenIntro). The integrated R labs developed by Mobilize bridge between working at the console and following an instructional document. Through the mobilizr package, students can load labs written to accompany the course directly into the Viewer pane in RStudio, allowing them to work through material without ever leaving RStudio. By providing the labs as part of the curricular materials we reduce the burden on teachers and allow students to work at their own pace. We will discuss the functionality of the labs as they stand, as well as developments in the .Rpres format that could allow for even more interactive learning.

Tuesday June 28, 2016 1:18pm - 1:36pm PDT

SIEPR 130

SIEPR 130

Transforming a museum to be data-driven using R

**Moderators**
## Torben Tvedebrink

**Speakers**

With the exponential growth of data, more and more businesses are demanding to become data-driven. Seeking value from their data, big data and data science initiatives; jobs and skill sets have risen up the business agenda. R, being a data scientists' best friend, plays an important role in this transformation. But how do you transform a traditionally un-data-orientated business into being data-driven armed with R, data science processes and plenty of enthusiasm?

The first data scientist at a museum shares her experience on the journey to transform the 250-year-old British Museum to be data-driven by 2018. How is one of the most popular museums in the world, with 6.8 million annual visitors, using R to achieve a data-driven transition?• Data wrangling • Exploring data to make informed decisions • Winning stakeholders' support with data visualisations and dashboard • Predictive modelling • Future uses including internet of things, machine learning etc.

Using R and data science, any organisation can become data driven. With data and analytical skills demand higher than supply, more businesses need to know that R is part of the solution and that R is a great language to learn for individuals wanting to get into data science.

The first data scientist at a museum shares her experience on the journey to transform the 250-year-old British Museum to be data-driven by 2018. How is one of the most popular museums in the world, with 6.8 million annual visitors, using R to achieve a data-driven transition?• Data wrangling • Exploring data to make informed decisions • Winning stakeholders' support with data visualisations and dashboard • Predictive modelling • Future uses including internet of things, machine learning etc.

Using R and data science, any organisation can become data driven. With data and analytical skills demand higher than supply, more businesses need to know that R is part of the solution and that R is a great language to learn for individuals wanting to get into data science.

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Tuesday June 28, 2016 1:18pm - 1:36pm PDT

McCaw Hall

McCaw Hall

A Case Study in Reproducible Model Building: Simulating Groundwater Flow in the Wood River Valley Aquifer System, Idaho

**Moderators**

**Speakers**

The goal of reproducible model building is to tie processing instructions to data analysis so that the model can be recreated, better understood, and easily modified to incorporate new field measurements and (or) explore alternative system and boundary conceptualizations. Reproducibility requires archiving and documenting all raw data and source code used to pre- and post-process the model; an undertaking made easier by the advances in open source software, open file formats, and cloud computing. Using a software development methodology, a highly reproducible model of groundwater flow in the Wood River Valley (WRV) aquifer system was built. The collection of raw data, source code, and processing instructions used to build and analyze the model was placed in an R package. An R package allows for easy, transparent, and cross-platform distribution of its content by enforcing a set of formal format standards. R largely facilitates reproducible research with the package vignette, a document that combines content and data analysis source code. The code is run when the vignette is built, and all data analysis output (such as figures and tables) is created extemporaneously and inserted into the final document. The R package created for the WRV groundwater-flow model includes multiple vignettes that explain and run all processing steps; the exception to this being the parameter estimation process, which was not made programmatically reproducible. MODFLOW-USG, the numerical groundwater model used in this case study, is executed from a vignette, and model output is returned for exploratory analyses.

Tuesday June 28, 2016 1:36pm - 1:54pm PDT

SIEPR 120

SIEPR 120

bigKRLS: Optimizing non-parametric regression in R

Data scientists are increasingly interested in modeling techniques involving relatively few parametric assumptions, particularly when analyzing large or complex datasets. Though many approaches have been proposed for this situation, Hainmueller and Hazlett's (2014) Kernel-regularized Least Squares (KRLS) offers statistical and interpretive properties that are attractive for theory development and testing. KRLS allows researchers to estimate the average marginal effect (the slope) of an explanatory variable but (unlike parametric regression techniques whether classical or Bayesian) without the requirement that researchers know the functional form of the data generating process in advance. In conjunction with Tichonov regularization (which prevents overfitting), KRLS offers researchers the ability to investigate heterogeneous causal effects in a reasonably robust fashion. Further, KRLS estimates offers researchers several avenues to investigate how those effects depend on other observable, explanatory variables. We introduce bigKRLS, which markedly improves memory management over the existing R package, which is key since RAM usage is proportional to the number of observations squared. In addition, we allow users parallelize key routines (with the snow library) and shift matrix algebra operations to a distributed platform if desired (with bigmemory and bigalgebra). As an example, we estimate a model from a voter turnout experiment. The results show how the effects of a randomized treatment (here, a get-out-the-vote message) depend on other variables. Finally, we briefly discuss which post-estimation quantities of interest will help users determine whether they have sufficiently large sample size for the asymptotics on which KRLS relies.

Tuesday June 28, 2016 1:36pm - 1:54pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

CVXR: An R Package for Modeling Convex Optimization Problems

**Moderators**
## Torben Tvedebrink

**Speakers**
## Anqi Fu

CVXR is an R package that provides an object-oriented modeling language for convex optimization. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known curvature and monotonicity properties. CVXR then applies signed disciplined convex programming (DCP) to verify the problem's convexity and, once verified, converts the problem into a standard conic form using graph implementations and passes it to an open-source cone solver such as ECOS or SCS. We demonstrate CVXR's modeling framework with several applications.

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Life Science Research Professional, Stanford University

I am a Life Science Research Professional working with Dr. Stephen Boyd and Dr. Lei Xing on applications of convex optimization to radiation treatment planning. Prior to here, I was a Machine Learning Scientist at H2O.ai, developing and testing large-scale, distributed algorithms... Read More →

Tuesday June 28, 2016 1:36pm - 1:54pm PDT

McCaw Hall

McCaw Hall

Introducing Statistics with intRo

intRo is a modern web-based application for performing basic data analysis and statistical routines as well as an accompanying R package. Leveraging the power of R and Shiny, intRo implements common statistical functions in a powerful and extensible modular structure, while remaining simple enough for the novice statistician. This simplicity lends itself to a natural presentation in an introductory statistics course as a substitute for other commonly used statistical software packages, such as Excel and JMP. intRo is currently deployed at the URL http://www.intro-stats.com. In this talk, we describe the underlying design and functionality of intRo, including its extensible modular structure, illustrate its use with a live demo, and discuss future improvements that will enable a wider adoption of intRo in introductory statistics courses.

Tuesday June 28, 2016 1:36pm - 1:54pm PDT

SIEPR 130

SIEPR 130

What can R learn from Julia

**Moderators**
## Ioannis Kosmidis

**Speakers**

Julia, like R, is a dynamic language for scientific computing but, unlike R, it was explicitly designed to deliver performance competitive to traditional batch-compiled languages. To achieve this Julia's designers made a number of unusual choices, including the presence of a set of type annotations that are used for dispatching methods and speed up code, but not for type-checking. The result is that many Julia programs are competitive with equivalent programs written in C. This talk gives a brief overview of the key points of Julia's design and considers whether similar ideas could be adopted in R.

Associate Professor, Department of Statistical Science, University College London

I am a Senior Lecturer at the Department of Statistical Science in University College London. My theoretical and methodological research focuses on optimal estimation and inference from complex statistical models, penalized likelihood methods and clustering. A particular focus of... Read More →

Tuesday June 28, 2016 1:36pm - 1:54pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

A first-year undergraduate data science course

In this talk we will discuss an R based first-year undergraduate data science course taught at Duke University for an audience of students with little to no computing or statistical background. The course focuses on data wrangling and munging, exploratory data analysis, data visualization, and effective communication. The course is designed to be a first course in statistics for students interested in pursuing a quantitative major. Unlike most traditional introductory statistics courses, this course approaches statistics from a model-based, instead of an inference-based, perspective, and introduces simulation-based inference and Bayesian inference later in the course. A heavy emphasis is placed on reproducibility (with R Markdown) and version control and collaboration (with git/GitHub). We will discuss in detail course structure, logistics, and pedagogical considerations as well as give examples from the case studies used in the course. We will also share student feedback and assessment of the success of the course in recruiting students to the statistical science major.

Tuesday June 28, 2016 1:54pm - 2:12pm PDT

SIEPR 130

SIEPR 130

Modeling Food Policy Decision Analysis with an Interactive Bayesian Network in Shiny

**Moderators**

**Speakers**
## Rachel Lynne Wilkerson

The efficacy of policy interventions for socioeconomic challenges, like food insecurity, is difficult to measure due to a limited understanding of the complex web of causes and consequences. As an additional complication, limited data is available for accurate modeling. Thorough risk based decision making requires appropriate statistical inference and a combination of data sources. The federal summer meals program is a part of the safety net for food insecure families in the US, though the operations of the program itself are subject to risk. These uncertainties stem from variables both about internal operations as well as external food environment. Local partners often incur risk in operating the program; thus we use decision analysis to minimize the risks. After integrating public, private, and government data sources to create an innovative repository focused on the operations of the child nutrition programs, we construct a Bayesian network of variables that determine a successful program and compute the expected utility. Through an expected utility analysis, we can identify the key factors in minimizing the risk of program operations. This allows us to optimize the possible policy interventions, offering community advocates a data driven approach to prioritizing possible programmatic changes. This work represents substantial progress towards innovative use of government data as well as a novel application of Bayesian networks to public policy. The mathematical modeling is also supplemented by a community-facing application developed in Shiny that aims to educate local partners about evidence based decision making for the program operations.

Baylor University

I'm excited about civic uses of data, Bayesian analysis, and new uses of Shiny.
Slides from my talk: http://rpubs.com/rachwhatsit/useR2016

Tuesday June 28, 2016 1:54pm - 2:12pm PDT

SIEPR 120

SIEPR 120

Multiple Hurdle Tobit models in R: The mhurdle package

mhurdle is a package for R enabling the estimation of a wide set of regression models where the dependent variable is left censored at zero, which is typically the case in household expenditure surveys. These models are of particular interest to explain the presence of a large proportion of zero observations for the dependent variable by means of up to three censoring mechanisms, called hurdles. For the analysis of censored household expenditure data, these hurdles express a good selection mechanism, a desired consumption mechanism and a purchasing mechanism, respectively. However, the practical scope of these paradigmatic hurdles is not restricted to empirical demand analysis, as they have been fruitfully used in other fields of economics, including labor economics and contingent valuation. For each these censoring mechanisms, a continuous latent variable is defined, indicating that censoring is in effect when the latent variable is negative. Latent variables are modeled as the sum of a linear function of explanatory variables and of a normal random disturbance with a possible correlation between the disturbances of different latent variables. To model possible departures of the observed dependent variable to normality, we use flexible transformations allowing rescaling skewed or leptokurtic random variables to heteroscedastic normality. mhurdle models are estimated using the maximum likelihood method for random samples. Model evaluation and selection are tackled by means of goodness of fit measures and Vuong tests. Real-world illustrations of the estimation of multiple hurdles models are provided using data from consumer expenditure surveys.

Tuesday June 28, 2016 1:54pm - 2:12pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Statistics and R in Forensic Genetics

**Moderators**
## Torben Tvedebrink

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

**Speakers**
## Mikkel Meyer Andersen

Genetic evidence is often used as evidence in disputes. Mostly, the genetic evidence is DNA profiles and the disputes are often familial or crime cases. In this talk, we go through the statistical framework of evaluating genetic evidence by calculating an evidential weight. The focus will be the statistical aspects of how DNA material from the male Y chromosome can help resolve sexual assult cases. In particular, how an evidential weight of Y chromosomal DNA can be calculated using various statistical methods and how the methods use statistics and R. One of the methods is the discrete Laplace method which is a statistical model consisting of a mixture of discrete Laplace distributions (an exponential family). We demonstrate how inference for that method was initially done using R's built-in glm function with a new family function for the discrete Laplace distribution. We also explain how inference was speeded up by recognising the model as a weighted two-way layout with implicit model matrix and how this was implemented as a special case of iteratively reweighted least squares.

Associate Professor, Department of Mathematical Sciences, Aalborg University

Assistant Professor, Department of Mathematical Sciences

I'm an applied statistician working with statistics for forensic genetics. R (as well as other programming languages and technologies) is a core part of my teaching and research. I was part of the local organising committee for useR! 2015 in Aalborg. useR! 2016 is my 3rd useR! conference... Read More →

Tuesday June 28, 2016 1:54pm - 2:12pm PDT

McCaw Hall

McCaw Hall

Zero-overhead integration of R, JS, Ruby and C/C++

**Moderators**
## Ioannis Kosmidis

**Speakers**

R is very powerful and flexible, but certain tasks are best solved by using R in combination with other programming languages. GNU R includes APIs to talk to some languages, e.g., Fortran and C/C++, and there are interfaces to other languages provided by various packages, e.g., Java and JS. All these interfaces incur significant overhead in terms of performance, usability, maintainability and overall system complexity. This is caused, to a large degree, by the different execution strategies employed by different languages, e.g., compiled vs. interpreted, and by incompatible internal data representations. The Truffle framework addresses these issues at a very fundamental level, and builds the necessary polyglot primitives directly into the runtime. Consequently, the FastR project, whose goal is to deliver an alternative but fully-compatible R runtime, leverages this infrastructure to allow multiple languages to interact transparently and seamlessly. All parts of a polyglot application can be compiled by the same optimizing compiler, called Graal, and can be executed and debugged simultaneously, with little to no overhead at the language boundary. This talk introduces FastR and the basic concepts driving Truffle’s transparent interoperability, along with a demo of the polyglot capabilities of the FastR runtime.

Associate Professor, Department of Statistical Science, University College London

I am a Senior Lecturer at the Department of Statistical Science in University College London. My theoretical and methodological research focuses on optimal estimation and inference from complex statistical models, penalized likelihood methods and clustering. A particular focus of... Read More →

Tuesday June 28, 2016 1:54pm - 2:12pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Detection of Differential Item Functioning with difNLR function

**Moderators**

**Speakers**
## Patrícia Martinková

In this work we present a new method for detection of Differential Item Functioning (DIF) based on Non-Linear Regression. Detection of DIF has been considered one of the most important topics in measurement and is implemented within packages difR, lordif and others. Procedures based on Logistic Regression are one of the most popular in the study field, however, they do not take into account possibility of guessing or probability of carelessness, which are expectable in multiple-choice tests or in patient reported outcome measures. Methods based on Item Response Theory (IRT) models can count for guessing or for carelessness/inattention, but these latent models may be harder to explain to general audience. We present an extension of Logistic Regression procedure by including probability of guessing and probability of carelessness. This general method based on Non-Linear Regression (NLR) model is used for estimation of Item Response Function and for detection of uniform and non-uniform DIF in dichotomous items. Simulation study suggests that NLR method outperforms or is comparable with the LR-based or IRT-based methods. The new difNLR function provides a nice graphical output and is presented as part of Shiny application ShinyItemAnalysis, which is available online.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

Tuesday June 28, 2016 2:12pm - 2:30pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

FiveThirtyEight's data journalism workflow with R

**Moderators**
## Torben Tvedebrink

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

**Speakers**
## Andrew Flowers

FiveThirtyEight is a data journalism site that uses R extensively for charts, stories, and interactives. We’ve used R for stories covering: p-hacking in nutrition science; how Uber is affecting New York City taxis; workers in minimum-wage jobs; the frequency of terrorism in Europe; the pitfalls in political polling; and many, many more.

R is used in every step of the data journalism process: for cleaning and processing data, for exploratory graphing and statistical analysis, for models deploying in real time as and to create publishable data visualizations. We write R code to underpin several of our popular interactives, as well, like the Facebook Primary and our historical Elo ratings of NBA and NFL teams. Heck, we’ve even styled a custom ggplot2 theme. We even use R code on long-term investigative projects.

In this presentation, I’ll walk through how cutting-edge, data-oriented newsrooms like FiveThirtyEight use R by profiling a series of already-published stories and projects. I’ll explain our use of R for chart-making in sports and politics stories; for the data analysis behind economics and science feature pieces; and for production-worthy interactives.

R is used in every step of the data journalism process: for cleaning and processing data, for exploratory graphing and statistical analysis, for models deploying in real time as and to create publishable data visualizations. We write R code to underpin several of our popular interactives, as well, like the Facebook Primary and our historical Elo ratings of NBA and NFL teams. Heck, we’ve even styled a custom ggplot2 theme. We even use R code on long-term investigative projects.

In this presentation, I’ll walk through how cutting-edge, data-oriented newsrooms like FiveThirtyEight use R by profiling a series of already-published stories and projects. I’ll explain our use of R for chart-making in sports and politics stories; for the data analysis behind economics and science feature pieces; and for production-worthy interactives.

Associate Professor, Department of Mathematical Sciences, Aalborg University

Quantitative Editor, FiveThirtyEight

As the quantitative editor of FiveThirtyEight, I write stories about a variety of topics -- economics, politics, sports -- while also doing data science tasks for other staff writers. Before starting at FiveThirtyEight in 2013, I was at the Federal Reserve Bank of Atlanta.

Tuesday June 28, 2016 2:12pm - 2:30pm PDT

McCaw Hall

McCaw Hall

RosettaHUB-Sheets, a programmable, collaborative web-based spreadsheet for R, Python and Spark

**Moderators**
## Ioannis Kosmidis

I am a Senior Lecturer at the Department of Statistical Science in University College London. My theoretical and methodological research focuses on optimal estimation and inference from complex statistical models, penalized likelihood methods and clustering. A particular focus of... Read More →

**Speakers**

RosettaHUB-Sheets combine the flexibility of the bi-dimensional data representation model of classic spreadsheets with the power of R, Python, Spark and SQL. RosettaHUB-Sheets are web based, they can be created programmatically on any cloud. They enable Google-docs like real-time collaboration while preserving the user's data privacy. They have no limitation of size and can leverage the cloud for performance and scalability. RosettaHUB-Sheets act as highly flexible bi-dimensional notebook as they make it possible to create powerful mash-ups of multi-language scripts and results. RosettaHUB-Sheets are combined with an interactive widgets framework with the ability to overlay advanced interactive widgets and visualizations including 3D Paraview ones. RosettaHUB-Sheets are fully programmable in R, Python and JavaScript, macros similar to Excel VBA's can be triggered by various cells and variables states changes events. RosettaHUB-sheets can be shared to allow real-time collaboration, interactive teaching, etc. RosettaHUB-Sheets' are represented by an SQL database and can be queried and updated in pure SQL. The RosettaHUB Excel add-in makes it possible to synchronize a local Excel sheet with a RosettaHUB-Sheet on the cloud: Excel becomes capable of accessing any R or Python function as a formula and can interact seamlessly with powerful cloud-based capabilities, likewise, any Excel VBA function or data can be seamlessly exposed and shared to the web. RosettaHUB-sheet are the first bi-dimensional data science notebooks they give access to the most popular data-science tools and aim to contribute to the democratization and pervasiveness of data science.

Associate Professor, Department of Statistical Science, University College London

Tuesday June 28, 2016 2:12pm - 2:30pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Teaching R to 200 people in a week

**Moderators**

**Speakers**
## Michael Levy

Across disciplines, scholars are waking up to the potential benefits of computational competence. This has created a surge in demand for computational education which has gone widely underserved. Software Carpentry and similar efforts have worked to fill this gap with short, intensive introductions to computational tools, including R. Such an approach has numerous advantages; however, it is labor intensive, with student:instructor ratios typically below ten, and it is diffuse, introducing three major tools in two days. I recently adapted Software Carpentry strategies and tactics to provide a deeper introduction to R over the course of a week with a student:instructor ratio above 50. Here, I reflect on what worked and what I would change, with the goal of providing other educators with ideas for improving computational education. Aspects of the course that worked well include live coding during lectures, which builds in flexibility, demonstrates the debugging process, and forces a slower pace; multiple channels of feedback combined with flexibility to adapt to student needs and desires; and iterative, progressively more-open-ended exercises to solidify syntactical understanding and relate functions, idioms, and techniques to larger goals. Aspects of the course that I would change and caution other educators about include increasing the frequency and shortening the duration of student exercises, delaying the introduction of non-standard evaluation, and avoiding any prerequisite statistical understanding. These and other suggestions will benefit a variety of R instructors, whether for intensive introductions, traditional computing courses, or as a component of statistics courses.

PhD Candidate, University of California, Davis

Network analysis, environmental social science, R users' groups, teaching R and stats

Tuesday June 28, 2016 2:12pm - 2:30pm PDT

SIEPR 130

SIEPR 130

Wrap your model in an R package!

**Moderators**

**Speakers**

The groundwater drawdown model WTAQ-2, provided by the United States Geological Survey for free, has been "wrapped" into an R package, which contains functions for writing input files, executing the model engine and reading output files. By calling the functions from the R package a sensitivity analysis, calibration or validation requiring multiple model runs can be performed in an automated way. Automation by means of programming improves and simplifies the modelling process by ensuring that the WTAQ-2 wrapper generates consistent model input files, runs the model engine and reads the output files without requiring the user to cope with the technical details of the communication with the model engine. In addition the WTAQ-2 wrapper automatically adapts cross-dependent input parameters correctly in case one is changed by the user. This assures the formal correctness of the input file and minimises the effort for the user, who normally has to consider all cross-dependencies for each input file modification manually by consulting the model documentation. Consequently the focus can be shifted on retrieving and preparing the data needed by the model. Modelling is described in the form of version controlled R scripts so that its methodology becomes transparent and modifications (e.g. error fixing) trackable. The code can be run repeatedly and will always produce the same results given the same inputs. The implementation in the form of program code further yields the advantage of inherently documenting the methodology. This leads to reproducible results which should be the basis for smart decision making.

Tuesday June 28, 2016 2:12pm - 2:30pm PDT

SIEPR 120

SIEPR 120

A Large Scale Regression Model Incorporating Networks using Aster and R
**Poster #26**

Leveraging the Aster platform and the TeradataAsterR package, end users can overcome the challenges of memory/scalability limitations of R and the costs of transferring large amounts of data between platforms. We explore integration of R with Aster, a MPP database from Teradata, focusing on a predictive analytical case study from Wells Fargo. It’s always crucial for Wells Fargo to understand customer behaviors and why they do it. In this analysis, we utilized Aster’s graph analysis functionalities to explore customer relationship, and check how network effect changes customers’ behaviors. A logistic regression model was built, and a R shiny application was also used to visually represent impact of important attributes from the model.

Leveraging the Aster platform and the TeradataAsterR package, end users can overcome the challenges of memory/scalability limitations of R and the costs of transferring large amounts of data between platforms. We explore integration of R with Aster, a MPP database from Teradata, focusing on a predictive analytical case study from Wells Fargo. It’s always crucial for Wells Fargo to understand customer behaviors and why they do it. In this analysis, we utilized Aster’s graph analysis functionalities to explore customer relationship, and check how network effect changes customers’ behaviors. A logistic regression model was built, and a R shiny application was also used to visually represent impact of important attributes from the model.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

All-inclusive but Practical Multivariate Stochastic Forecasting for Electric Utility Portfolio
**Poster #2**

Electric utility portfolio risk simulation requires stochastically forecasting various time series data: power and gas prices, peak and off-peak loads, thermal, solar and wind generation, and other covariates, in different time granularities. All these together presents modeling issues of autocorrelation, linear and non-linear covariate relationships, non-normal distribution, outliers, seasonal and weekly shapes, heteroskedasticity, temporal disaggregation and dispatch optimization. As a practitioner, I’ll discuss how to organize and put together such a portfolio model from data scraping, simulation modeling, all the way to deployment through Shiny UI, while pointing out what worked what didn’t.

Electric utility portfolio risk simulation requires stochastically forecasting various time series data: power and gas prices, peak and off-peak loads, thermal, solar and wind generation, and other covariates, in different time granularities. All these together presents modeling issues of autocorrelation, linear and non-linear covariate relationships, non-normal distribution, outliers, seasonal and weekly shapes, heteroskedasticity, temporal disaggregation and dispatch optimization. As a practitioner, I’ll discuss how to organize and put together such a portfolio model from data scraping, simulation modeling, all the way to deployment through Shiny UI, while pointing out what worked what didn’t.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Applied Biclustering Using the BiclustGUI R Package
**Poster #15**

Big and high dimensional data with complex structures are emerging steadily and rapidly over the last few years. A relative new data analysis method that aims to discover meaningful patterns in a big data matrix is {\it biclustering}. This method applies clustering simultaneously on 2 dimensions of a data matrix and aims to find a subset of rows for which the response profile is similar across a subset of columns in the data matrix. This results in a submatrix called a bicluster. The package \texttt{RcmdrPlugin.BiclustGUI} is a GUI plug-in for R Commander for biclustering. It combines different biclustering packages to provide many algorithms for data analysis, visualisations and diagnostics tools in one unified framework. By choosing R Commander, the BiclustGUI produces the original R code in the background while using the interface; this is useful for more experienced R users who would like to transition from the interface to actual R code after using the algorithms. Further, the BiclustGUI package contains template scripts that allow future developers to create their own biclustering windows and include them in the package. The BiclustGUI is available on CRAN and on R-Forge. The GUI also has a Shiny implementation including all the main functionalities. Lastly the template scripts have been generalized in the \texttt{REST} package, a new helping tool for creating R Commander plug-ins.

Big and high dimensional data with complex structures are emerging steadily and rapidly over the last few years. A relative new data analysis method that aims to discover meaningful patterns in a big data matrix is {\it biclustering}. This method applies clustering simultaneously on 2 dimensions of a data matrix and aims to find a subset of rows for which the response profile is similar across a subset of columns in the data matrix. This results in a submatrix called a bicluster. The package \texttt{RcmdrPlugin.BiclustGUI} is a GUI plug-in for R Commander for biclustering. It combines different biclustering packages to provide many algorithms for data analysis, visualisations and diagnostics tools in one unified framework. By choosing R Commander, the BiclustGUI produces the original R code in the background while using the interface; this is useful for more experienced R users who would like to transition from the interface to actual R code after using the algorithms. Further, the BiclustGUI package contains template scripts that allow future developers to create their own biclustering windows and include them in the package. The BiclustGUI is available on CRAN and on R-Forge. The GUI also has a Shiny implementation including all the main functionalities. Lastly the template scripts have been generalized in the \texttt{REST} package, a new helping tool for creating R Commander plug-ins.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Bridging the Data Visualization to Digital Humanities gap: Introducing the Interactive Text Mining Suite
**Poster #24**

In recent years, there has been growing interest in data visualization for text analysis. While text mining and visualization tools have been successfully integrated into research methods in many fields, their use still remains infrequent in mainstream Digital Humanities. Many tools require extensive programming skills, which can be a roadblock for some literary scholars. Furthermore, while some visualization tools provide graphical user interfaces, many humanities researchers desire more interactive and user-friendly control of their data. In this talk we introduce the Interactive Text Mining Suite (ITMS), an application designed to facilitate visual exploration of digital collections. ITMS provides a dynamic interface for performing topic modeling, cluster detection, and frequency analysis. With this application, users gain control over model selection, text segmentation as well as graphical representation. Given the considerable variation in literary genres, we have also designed our graphical user interface to reflect choice of studies: scholarly articles, literary genre, and sociolinguistic studies. For documents with metadata we include tools to extract the metadata for further analysis. Development with the Shiny web framework provides a set of clean user interfaces, hopefully freeing researchers from the limitations of memory or platform dependency.

In recent years, there has been growing interest in data visualization for text analysis. While text mining and visualization tools have been successfully integrated into research methods in many fields, their use still remains infrequent in mainstream Digital Humanities. Many tools require extensive programming skills, which can be a roadblock for some literary scholars. Furthermore, while some visualization tools provide graphical user interfaces, many humanities researchers desire more interactive and user-friendly control of their data. In this talk we introduce the Interactive Text Mining Suite (ITMS), an application designed to facilitate visual exploration of digital collections. ITMS provides a dynamic interface for performing topic modeling, cluster detection, and frequency analysis. With this application, users gain control over model selection, text segmentation as well as graphical representation. Given the considerable variation in literary genres, we have also designed our graphical user interface to reflect choice of studies: scholarly articles, literary genre, and sociolinguistic studies. For documents with metadata we include tools to extract the metadata for further analysis. Development with the Shiny web framework provides a set of clean user interfaces, hopefully freeing researchers from the limitations of memory or platform dependency.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Community detection in multiplex networks : An application to the C. elegans neural network
**Poster #31**

We explore data from the neuronal network of the nematode C. elegans, a tiny hermaphroditic roundworm. The data consist of 279 neurons and 5863 directed connections between them, represented by three connectomes of electrical and chemical synapses. Our approach uses a fully Bayesian two-stage clustering method, based on the Dirichlet processes, that borrows information across the connectomes to identify communities of neurons via stochastic block modeling. This structure allows us to understand the communication patterns between the motor neurons, interneurons, and sensory neurons of the C. elegans nervous system.

We explore data from the neuronal network of the nematode C. elegans, a tiny hermaphroditic roundworm. The data consist of 279 neurons and 5863 directed connections between them, represented by three connectomes of electrical and chemical synapses. Our approach uses a fully Bayesian two-stage clustering method, based on the Dirichlet processes, that borrows information across the connectomes to identify communities of neurons via stochastic block modeling. This structure allows us to understand the communication patterns between the motor neurons, interneurons, and sensory neurons of the C. elegans nervous system.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Curde: Analytical curves detection
**Poster #21**

The main aim of our work is to develop the new R package curde. The package is used to detect line or conic curves in a digital image. The package contains the Hough transformation for a line detection using the accumulator. The Hough transform is a feature extraction technique and its purpose is to find imperfect instances of objects within a certain class of shapes. This technique is not suitable for curves with more than three parameters. For conic fitting, robust regression is used. For noisy data, solution based on Least Median of Squares (LMedS) is highly recommended. In this package, algorithms for non-user image evaluation is implemented. The whole process of the non-user image evaluation includes the image preparation. The preparation consists of various methods such as image grayscaling, thresholding or histogram estimation. The conversion from the grayscaled image to binary is realised by the calculation of the Sobel operator convolution and by the application of the threshold technique. After that the convolution technique is applied. The new R package curde will be the integration of all previous techniques to the one complex package.

The main aim of our work is to develop the new R package curde. The package is used to detect line or conic curves in a digital image. The package contains the Hough transformation for a line detection using the accumulator. The Hough transform is a feature extraction technique and its purpose is to find imperfect instances of objects within a certain class of shapes. This technique is not suitable for curves with more than three parameters. For conic fitting, robust regression is used. For noisy data, solution based on Least Median of Squares (LMedS) is highly recommended. In this package, algorithms for non-user image evaluation is implemented. The whole process of the non-user image evaluation includes the image preparation. The preparation consists of various methods such as image grayscaling, thresholding or histogram estimation. The conversion from the grayscaled image to binary is realised by the calculation of the Sobel operator convolution and by the application of the threshold technique. After that the convolution technique is applied. The new R package curde will be the integration of all previous techniques to the one complex package.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms with the ROAR Package
**Poster #29**

Markov Chain Monte Carlo algorithms are a general technique for learning probability distributions. However, they tend to mix slowly in complex, high-dimensional models, and scale poorly to large datasets. This package arose from the need for conducting high dimensional inference in large models using R. It provides a distributed version of stochastic based gradient variations of common continuous-based Metropolis algorithms, and utilizes the theory of optimal acceptance rates of Metropolis algorithms to automatically tune the proposal distribution to its optimal value. We describe how to use the package to learn complex distributions, and compare to other packages such as RStan.

Markov Chain Monte Carlo algorithms are a general technique for learning probability distributions. However, they tend to mix slowly in complex, high-dimensional models, and scale poorly to large datasets. This package arose from the need for conducting high dimensional inference in large models using R. It provides a distributed version of stochastic based gradient variations of common continuous-based Metropolis algorithms, and utilizes the theory of optimal acceptance rates of Metropolis algorithms to automatically tune the proposal distribution to its optimal value. We describe how to use the package to learn complex distributions, and compare to other packages such as RStan.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

High-performance R with FastR
**Poster #12**

R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. While these are straightforward to implement in an interpreter, it is hard to compile R functions to efficient bytecode or machine code. Consequently, applications that spend a lot of time in R code often have performance problems. Common solutions are to try to apply primitives to large amounts of data at once and to convert R code to a native language like C. FastR is a novel approach to solving R’s performance problem. It makes extensive use of the dynamic optimization features provided by the Truffle framework to remove the abstractions that the R language introduces, and can use the Graal compiler to create optimized machine code on the fly. This talk introduces FastR and the basic concepts behind Truffle’s optimization features. It provides examples of the language constructs that are particularly hard to implement using traditional compiler techniques, and shows how to use FastR to improve performance without compromising on language features.

R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. While these are straightforward to implement in an interpreter, it is hard to compile R functions to efficient bytecode or machine code. Consequently, applications that spend a lot of time in R code often have performance problems. Common solutions are to try to apply primitives to large amounts of data at once and to convert R code to a native language like C. FastR is a novel approach to solving R’s performance problem. It makes extensive use of the dynamic optimization features provided by the Truffle framework to remove the abstractions that the R language introduces, and can use the Graal compiler to create optimized machine code on the fly. This talk introduces FastR and the basic concepts behind Truffle’s optimization features. It provides examples of the language constructs that are particularly hard to implement using traditional compiler techniques, and shows how to use FastR to improve performance without compromising on language features.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

IMGTStatClonotype: An R package with integrated web tool for pairwise evaluation and visualization of IMGT clonotype diversity and expression from IMGT/HighV-QUEST output
**Poster #11**

The adaptive immune response is our ability to produce up to 2.10$^{12}$ different immunoglobulins (IG) or antibodies and T cell receptors (TR) per individual to fight pathogens. IMGT$^$, the international ImMunoGeneTics information system$^$ (http://www.imgt.org), was created in 1989 by Marie-Paule Lefranc (Montpellier University and CNRS) to manage the huge and complex diversity of these antigen receptors and is at the origin of immunoinformatics, a science at the interface between immunogenetics and bioinformatics. Next generation sequencing (NGS) generates millions of IG and TR nucleotide sequences, and there is a need for standardized analysis and statistical procedures in order to compare immune repertoires. IMGT/HighV-QUEST is the unique web portal for the analysis of IG and TR high throughput sequences. Its standardized statistical outputs include the characterization and comparison of the clonotype diversity in up to one million sequences. IMGT$^$ has recently defined a procedure for evaluating statistical significance of pairwise comparisons between differences in proportions of IMGT clonotype diversity and expression, per gene of a given IG or TR V, D or J group. The procedure is generic and suitable for detecting significant changes in IG and TR immunoprofiles in protective (vaccination, cancers and infections) or pathogenic (autoimmunity and lymphoproliferative disorders) immune responses. In this talk, I will present the new R package (’IMGTStatClonotype’) which incorporates the IMGT/StatClonotype tool developed by IMGT$^$ to perform pairwise comparisons of sets from IMGT/HighV-QUEST output through a user-friendly web interface in users’ own browser.

**Speakers**
## Safa Aouinti

The adaptive immune response is our ability to produce up to 2.10$^{12}$ different immunoglobulins (IG) or antibodies and T cell receptors (TR) per individual to fight pathogens. IMGT$^$, the international ImMunoGeneTics information system$^$ (http://www.imgt.org), was created in 1989 by Marie-Paule Lefranc (Montpellier University and CNRS) to manage the huge and complex diversity of these antigen receptors and is at the origin of immunoinformatics, a science at the interface between immunogenetics and bioinformatics. Next generation sequencing (NGS) generates millions of IG and TR nucleotide sequences, and there is a need for standardized analysis and statistical procedures in order to compare immune repertoires. IMGT/HighV-QUEST is the unique web portal for the analysis of IG and TR high throughput sequences. Its standardized statistical outputs include the characterization and comparison of the clonotype diversity in up to one million sequences. IMGT$^$ has recently defined a procedure for evaluating statistical significance of pairwise comparisons between differences in proportions of IMGT clonotype diversity and expression, per gene of a given IG or TR V, D or J group. The procedure is generic and suitable for detecting significant changes in IG and TR immunoprofiles in protective (vaccination, cancers and infections) or pathogenic (autoimmunity and lymphoproliferative disorders) immune responses. In this talk, I will present the new R package (’IMGTStatClonotype’) which incorporates the IMGT/StatClonotype tool developed by IMGT$^$ to perform pairwise comparisons of sets from IMGT/HighV-QUEST output through a user-friendly web interface in users’ own browser.

IMGT® - the international ImMunoGeneTics information system®, Institute of Human Genetics

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Imputing Gene Expression to Maximise Platform Compatibility
**Poster #9**

Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54,220 probes and the HG-U133A array contains a proper subset (21,722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.

Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54,220 probes and the HG-U133A array contains a proper subset (21,722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Integrating R & Tableau
**Poster #27**

Tableau is regularly used by our clients for the purposes of visualization and dashboarding, but they also often require the analytics and statistical functionality of R to analyze their data. While Tableau supports the integration of R, it is not always a straightforward process to blend the functionality of the two together. We plan to discuss our lessons learned from building Tableau applications that integrate with R, including best practices for performance optimization, sessionizing interaction on Tableau production servers, and reducing network latency issues. We will also discuss the limitations of Tableau’s R integration capability.

Our goal is help others working to avoid common frustrations and roadblocks when integrating R and Tableau.

Tableau is regularly used by our clients for the purposes of visualization and dashboarding, but they also often require the analytics and statistical functionality of R to analyze their data. While Tableau supports the integration of R, it is not always a straightforward process to blend the functionality of the two together. We plan to discuss our lessons learned from building Tableau applications that integrate with R, including best practices for performance optimization, sessionizing interaction on Tableau production servers, and reducing network latency issues. We will also discuss the limitations of Tableau’s R integration capability.

Our goal is help others working to avoid common frustrations and roadblocks when integrating R and Tableau.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Making Shiny Seaworthy: A weighted smoothing model for validating oceanographic data at sea.
**Poster #30**

The City of San Diego conducts one of the largest ocean monitoring programs in the world, covering ~340 square miles of coastal waters and sampling at sea ~150 days each year. Water quality monitoring is a cornerstone of the program and requires the use of sophisticated instrumentation to measure a suite of oceanographic parameters (e.g., temperature, depth, salinity, dissolved oxygen, pH). The various sensors or probes can be episodically temperamental, and oceanographic data can be inherently non-linear, especially within stratifications (i.e., where the water properties change rapidly with small changes in depth). This makes it difficult to distinguish between extreme observations due to natural events (anomalous data) and those due to instrumentation error (erroneous data), thus, requiring manual data validation at sea.

This Shiny app improves the manual validation process by providing a smoothing model to flag erroneous data points while including anomalous data. Standard smoothing models were unable to model stratification without including erroneous data, so we elected to use a custom weighted average model where observations with a greater deviation from the local mean have less weight.

We coupled this model with an interactive Shiny session using ggplot2 and R Portable to create an offline web application for use at sea. This Shiny app takes in a raw data file, presents a series of interactive graphs for removing/restoring potentially erroneous data, and exports a new data file. Additional customization of the Shiny interface using the shinyBS package, Javascript, and HTML improve the user experience.

**Speakers**
## Kevin Wayne Byron

The City of San Diego conducts one of the largest ocean monitoring programs in the world, covering ~340 square miles of coastal waters and sampling at sea ~150 days each year. Water quality monitoring is a cornerstone of the program and requires the use of sophisticated instrumentation to measure a suite of oceanographic parameters (e.g., temperature, depth, salinity, dissolved oxygen, pH). The various sensors or probes can be episodically temperamental, and oceanographic data can be inherently non-linear, especially within stratifications (i.e., where the water properties change rapidly with small changes in depth). This makes it difficult to distinguish between extreme observations due to natural events (anomalous data) and those due to instrumentation error (erroneous data), thus, requiring manual data validation at sea.

This Shiny app improves the manual validation process by providing a smoothing model to flag erroneous data points while including anomalous data. Standard smoothing models were unable to model stratification without including erroneous data, so we elected to use a custom weighted average model where observations with a greater deviation from the local mean have less weight.

We coupled this model with an interactive Shiny session using ggplot2 and R Portable to create an offline web application for use at sea. This Shiny app takes in a raw data file, presents a series of interactive graphs for removing/restoring potentially erroneous data, and exports a new data file. Additional customization of the Shiny interface using the shinyBS package, Javascript, and HTML improve the user experience.

Marine Biologist, City of San Diego

I am interested in developing software and statistical tools for supporting biological research. As a Marine Biologist for the City of San Diego's Ocean Monitoring Program's IT/GIS team, my group is responsible for data base management, low-level IT support, GIS, and R coordination... Read More →

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Multi-stage Decision Method To Generate Rules For Student Retention
**Poster #13**

The retention of college students is an important problem that may be analyzed by computing techniques, such as data mining, to identify students who may be at risk of dropping out. The importance of the problem has grown due to institutions’ requirement of meeting legislative retention mandates, face budget shortfalls due to decreased tuition or state-based revenue, and fall short of producing enough graduates in fields of need, such as computing. While data mining techniques were applied with some success, this article aims to show how R can be used to develop a hybrid methodology to enable rules to be created for the minority class with coverage and accuracy range which were not available as per existing literature. A multiple stage decision methodology (MSDM) used data mining techniques for extracting rules from an institution’s student data set to enable administrators to identify at risk students. The data mining techniques included partial decisions trees, K-means clustering, and Apriori association mining to be implemented in R. MSDM was able to identify students with up to 89% accuracy on student datasets, where the number of at risk students was fewer than the retained students that made the at risk model difficult to build. The motivation for using R was twofold. First, to generate rules for minority class, and second, use R to make it reproducible.

**Speakers**
## Soma Datta

The retention of college students is an important problem that may be analyzed by computing techniques, such as data mining, to identify students who may be at risk of dropping out. The importance of the problem has grown due to institutions’ requirement of meeting legislative retention mandates, face budget shortfalls due to decreased tuition or state-based revenue, and fall short of producing enough graduates in fields of need, such as computing. While data mining techniques were applied with some success, this article aims to show how R can be used to develop a hybrid methodology to enable rules to be created for the minority class with coverage and accuracy range which were not available as per existing literature. A multiple stage decision methodology (MSDM) used data mining techniques for extracting rules from an institution’s student data set to enable administrators to identify at risk students. The data mining techniques included partial decisions trees, K-means clustering, and Apriori association mining to be implemented in R. MSDM was able to identify students with up to 89% accuracy on student datasets, where the number of at risk students was fewer than the retained students that made the at risk model difficult to build. The motivation for using R was twofold. First, to generate rules for minority class, and second, use R to make it reproducible.

Assistant Professor, University Of Houston Clear Lake

R in decision trees and Apriori, Controlled decision trees, Teaching R in school.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

mvarVis: An R package for Visualization of Multivariate Analysis Results
**Poster #28**

mvarVis is an R package for visualization of diverse multivariate analysis methods. We implement two new tools to facilitate analysis that are cumbersome with existing software. The first uses htmlwidgets and d3 to create interactive ordination plots; the second makes it easy to bootstrap multivariate methods and align the resulting scores. The interactive visualizations offer an alternative to printing multiple plots with different supplementary information overlaid, and bootstrapping enables a qualitative assessment of the uncertainty underlying the application of exploratory multivariate methods on particular data sets.

Our approach is to leverage existing packages -- FactoMineR, ade4, and vegan -- to perform the actual dimension reduction, and build a new layer for visualizing and bootstrapping their results. This allows our tools to wrap a variety of existing methods, including one table, multitable, and distance-based approaches -- principal components, multiple factor analysis, and multidimensional scaling, for example. Since our package uses htmlwidgets, it is possible to embed our interactive plots in Rmarkdown pages and Shiny apps. All code and many examples are available on our github.

mvarVis is an R package for visualization of diverse multivariate analysis methods. We implement two new tools to facilitate analysis that are cumbersome with existing software. The first uses htmlwidgets and d3 to create interactive ordination plots; the second makes it easy to bootstrap multivariate methods and align the resulting scores. The interactive visualizations offer an alternative to printing multiple plots with different supplementary information overlaid, and bootstrapping enables a qualitative assessment of the uncertainty underlying the application of exploratory multivariate methods on particular data sets.

Our approach is to leverage existing packages -- FactoMineR, ade4, and vegan -- to perform the actual dimension reduction, and build a new layer for visualizing and bootstrapping their results. This allows our tools to wrap a variety of existing methods, including one table, multitable, and distance-based approaches -- principal components, multiple factor analysis, and multidimensional scaling, for example. Since our package uses htmlwidgets, it is possible to embed our interactive plots in Rmarkdown pages and Shiny apps. All code and many examples are available on our github.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Prediction of key parameters in the production of biopharmaceuticals using R
**Poster #4**

In this contribution we present our workflow for model prediction in E. coli fed-batch production processes using R. The major challenges in this context are the fragmentary understanding of bioprocesses and the severely limited real-time access to process variables related to product quality and quantity. Data driven modeling of process variables in combination with model predictive process control concepts represent a potential solution to these problems. In R the statistical techniques best qualified for bioprocess data analysis and modeling are readily available.

In a benchmark study the performance of a number of machine learning methods is evaluated, i.e., random forest, neural networks, partial least squares and structured additive regression models. For that purpose a series of recombinant E. coli fed-batch production processes with varying cultivation conditions employing a comprehensive on- and offline process monitoring platform was conducted. The prediction of cell dry mass and recombinant protein based on online available process parameters and two-dimensional multi-wavelength fluorescence spectroscopy is investigated. Parameter optimization and model validation are performed in the framework of a leave-one-fermentation-out cross validation. Computations are performed using among others the R packages robfilter, boost, nnet, randomForest, pls and caret. The results clearly argue for a combined approach: neural networks as modeling technique and random forest as variable selection tool.

In this contribution we present our workflow for model prediction in E. coli fed-batch production processes using R. The major challenges in this context are the fragmentary understanding of bioprocesses and the severely limited real-time access to process variables related to product quality and quantity. Data driven modeling of process variables in combination with model predictive process control concepts represent a potential solution to these problems. In R the statistical techniques best qualified for bioprocess data analysis and modeling are readily available.

In a benchmark study the performance of a number of machine learning methods is evaluated, i.e., random forest, neural networks, partial least squares and structured additive regression models. For that purpose a series of recombinant E. coli fed-batch production processes with varying cultivation conditions employing a comprehensive on- and offline process monitoring platform was conducted. The prediction of cell dry mass and recombinant protein based on online available process parameters and two-dimensional multi-wavelength fluorescence spectroscopy is investigated. Parameter optimization and model validation are performed in the framework of a leave-one-fermentation-out cross validation. Computations are performed using among others the R packages robfilter, boost, nnet, randomForest, pls and caret. The results clearly argue for a combined approach: neural networks as modeling technique and random forest as variable selection tool.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

R Microplots in Tables with the latex() Function
**Poster #16**

Microplots are often used within cells of a tabular array. We describe several simple R functions that simplify the use of microplots within LaTeX documents constructed within R. These functions are coordinated with the latex() function in the Hmisc package or the xtable function in the xtable package. We show examples using base graphics, and three graphics systems based on grid: lattice graphics, gg2plot graphcs, and vcd graphics. These functions work smoothly with standalone LaTeX documents and with Sweave, with knitr, with org mode and with Rmarkdown.

**Speakers**
## Richard Heiberger

Microplots are often used within cells of a tabular array. We describe several simple R functions that simplify the use of microplots within LaTeX documents constructed within R. These functions are coordinated with the latex() function in the Hmisc package or the xtable function in the xtable package. We show examples using base graphics, and three graphics systems based on grid: lattice graphics, gg2plot graphcs, and vcd graphics. These functions work smoothly with standalone LaTeX documents and with Sweave, with knitr, with org mode and with Rmarkdown.

Professor Emeritus, Temple University, Department of Statistics, Fox School of Business

Censuses and Surveys of the Jewish PeopleRich Heiberger is co-chair of P'nai Or Philadelphia, and was on the Board of the NHC some years ago.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

R Shiny Application for the Evaluation of Surrogacy in Clinical Trials
**Poster #18**

In clinical trials, the determination of the true endpoint or the effect of a new therapy on the true endpoint may be difficult, requiring an expensive, invasive or uncomfortable procedure. Furthermore, in some trials the primary endpoint of interest (the “true endpoint”), for example death, is rare and/or takes a long period of time to reach. In such trials, there would be benefit in finding a more proximate endpoint (the “surrogate endpoint”) to determine more quickly the effect of an intervention.

We present a new R Shiny application for the evaluation of surrogate endpoints in randomized clinical trials using patients data. The Shiny application for surrogacy consists of a set of friendly user function which allow the evaluation of different types of endpoints (i.e., continuous, categorical, binary, survival endpoints) and produce a unified and interoperable output. With this new Shiny App, the user does not need to have the R software installed on his computer. It is a web based application. It can also be run from any device with internet connection.

We demonstrate the usage and capacities of this Shiny App for surrogacy using several examples clinical trials in which validation of a surrogate to the primary endpoint in the trials was of interest.

In clinical trials, the determination of the true endpoint or the effect of a new therapy on the true endpoint may be difficult, requiring an expensive, invasive or uncomfortable procedure. Furthermore, in some trials the primary endpoint of interest (the “true endpoint”), for example death, is rare and/or takes a long period of time to reach. In such trials, there would be benefit in finding a more proximate endpoint (the “surrogate endpoint”) to determine more quickly the effect of an intervention.

We present a new R Shiny application for the evaluation of surrogate endpoints in randomized clinical trials using patients data. The Shiny application for surrogacy consists of a set of friendly user function which allow the evaluation of different types of endpoints (i.e., continuous, categorical, binary, survival endpoints) and produce a unified and interoperable output. With this new Shiny App, the user does not need to have the R software installed on his computer. It is a web based application. It can also be run from any device with internet connection.

We demonstrate the usage and capacities of this Shiny App for surrogacy using several examples clinical trials in which validation of a surrogate to the primary endpoint in the trials was of interest.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

RCAP Designer: An RCloud Package to create Analytical Dashboards
**Poster #25**

RCloud is an open source social coding environment for Big Data analytics and visualization developed by AT&T labs. We discuss RCAP Designer, an RCloud package that provides a way for Data Scientists to build R web applications similar to Shiny in the RStudio environment.

RCAP designer creates a workflow where the source R code is created within the RCloud environment in an R notebook. The package allows the data scientist to transform this notebook into an R dashboard application. This does not require developing web code (JavaScript, CSS, etc.). A number of widgets have been developed for creating the page design, several kinds of contents (R plots, interactive plots, an iframe, etc) and the different event controls for the page. For example, to include an R plot, one would drag and drop the RPlot widget onto the canvas. After the appropriate sizing of the plot window, the widget is configured to select the R plot function from the current workspace, and automatically link it to the control parameters. Once the design elements are saved, RCAP uses RCloud to render the page on the fly.

High level RCAP design considerations: On the server (R) side, RCAP produces the appropriate wrapping for the user’s R code with the necessary templates to push the results back to the client side. This includes all of the RCloud commands and various error catching mechanisms. These wrapped functions are exposed to the JavaScript via OCAP. The user can just do normal plotting code and RCAP makes sure it appears on the page. The JavaScript supplied by the widgets is in charge of the layout. It lays out the grid, loads the text, iframes and any other static content. The event controller widgets in RCAP use the reactive programming paradigm.RCAP is a statistician’s convenient web publishing tool for R analytics and visualizations developed within the RCloud environment.

References:

Subramaniam. G, Larchuk. T, Urbanek. S and Archibad. R (2014). iwplot: An R Package for Creating web Based Interactive. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014

Woodhull. G, RCloud – Integrating Exploratory Visualization, Analysis and Deployment. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org.

RStudio, Inc, shiny: Easy web applications in R, 2014, URL: http://shiny.rstudio.com

RCloud is an open source social coding environment for Big Data analytics and visualization developed by AT&T labs. We discuss RCAP Designer, an RCloud package that provides a way for Data Scientists to build R web applications similar to Shiny in the RStudio environment.

RCAP designer creates a workflow where the source R code is created within the RCloud environment in an R notebook. The package allows the data scientist to transform this notebook into an R dashboard application. This does not require developing web code (JavaScript, CSS, etc.). A number of widgets have been developed for creating the page design, several kinds of contents (R plots, interactive plots, an iframe, etc) and the different event controls for the page. For example, to include an R plot, one would drag and drop the RPlot widget onto the canvas. After the appropriate sizing of the plot window, the widget is configured to select the R plot function from the current workspace, and automatically link it to the control parameters. Once the design elements are saved, RCAP uses RCloud to render the page on the fly.

High level RCAP design considerations: On the server (R) side, RCAP produces the appropriate wrapping for the user’s R code with the necessary templates to push the results back to the client side. This includes all of the RCloud commands and various error catching mechanisms. These wrapped functions are exposed to the JavaScript via OCAP. The user can just do normal plotting code and RCAP makes sure it appears on the page. The JavaScript supplied by the widgets is in charge of the layout. It lays out the grid, loads the text, iframes and any other static content. The event controller widgets in RCAP use the reactive programming paradigm.RCAP is a statistician’s convenient web publishing tool for R analytics and visualizations developed within the RCloud environment.

References:

Subramaniam. G, Larchuk. T, Urbanek. S and Archibad. R (2014). iwplot: An R Package for Creating web Based Interactive. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014

Woodhull. G, RCloud – Integrating Exploratory Visualization, Analysis and Deployment. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org.

RStudio, Inc, shiny: Easy web applications in R, 2014, URL: http://shiny.rstudio.com

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Sequence Analysis with Package TraMineR
**Poster #6**

Sequence analysis started in biological science to examine pattern of protein DNA and subsequently applied in social sciences to study the pattern of sequences from individual’s life course. Many social science studies concerned with time series are recorded in sequences. Past studies using sequence analysis include footsteps of dances, class careers, employment biographies, family histories, school-to-work transitions, occupational career pattern, and other life-course trajectories.

The TraMineR is a package specially designed for carrying out sequence analysis for the social sciences (Gabadinho, Studer, Muller, Buergin, & Ritschard, 2015). It is a data mining tool that is most appropriate to mine and group social sequence data. It contains toolbox for the manipulation, description and rendering of sequences and functions to produce graphical output to describe state sequences, categorical sequences, sequence visualization, and sequence complexity. It also offers functions for computing distances between sequences with different metrics, which includes optimal matching, longest common prefix and longest common subsequence. In combination with cluster analysis and multidimensional scaling, typology can be formed to understand the life-course trajectories by grouping the sequences into groups.

I will briefly outline the key functionalities of TraMineR and demonstrate the procedure for carrying out social sequence analysis with real life examples to highlight the usefulness of the TraMineR package. Other R packages related to sequence analysis will also be covered during the session.

Sequence analysis started in biological science to examine pattern of protein DNA and subsequently applied in social sciences to study the pattern of sequences from individual’s life course. Many social science studies concerned with time series are recorded in sequences. Past studies using sequence analysis include footsteps of dances, class careers, employment biographies, family histories, school-to-work transitions, occupational career pattern, and other life-course trajectories.

The TraMineR is a package specially designed for carrying out sequence analysis for the social sciences (Gabadinho, Studer, Muller, Buergin, & Ritschard, 2015). It is a data mining tool that is most appropriate to mine and group social sequence data. It contains toolbox for the manipulation, description and rendering of sequences and functions to produce graphical output to describe state sequences, categorical sequences, sequence visualization, and sequence complexity. It also offers functions for computing distances between sequences with different metrics, which includes optimal matching, longest common prefix and longest common subsequence. In combination with cluster analysis and multidimensional scaling, typology can be formed to understand the life-course trajectories by grouping the sequences into groups.

I will briefly outline the key functionalities of TraMineR and demonstrate the procedure for carrying out social sequence analysis with real life examples to highlight the usefulness of the TraMineR package. Other R packages related to sequence analysis will also be covered during the session.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

shinyGEO: a web application for analyzing Gene Expression Omnibus (GEO) datasets using shiny
**Poster #7**

Identifying associations between patient gene expression profiles and clinical data provides insight into the biological processes associated with health and disease. The Gene Expression Omnibus (GEO) is a public repository of gene expression and sequence-based datasets, and currently includes >42,000 datasets with gene expression profiles obtained by microarray. Although GEO has its own analysis tool (GEO2R) for identifying differentially expressed genes, the tool is not designed for advanced data analysis and does not generate publication-ready graphics. In this work, we describe a web-based, easy-to-use tool for biomarker analysis in GEO datasets, called shinyGEO.

shinyGEO is a web-based tool that provides a graphical user interface for users without R programming experience to quickly analyze GEO datasets. The tool is developed using 'shiny', a web application framework for R. Specifically, shinyGEO allows a user to download the expression and clinical data from a GEO dataset, to modify the dataset correcting for spelling and misaligned data frame columns, to select a gene of interest, and to perform a survival or differential expression analysis using the available data. The tool uses the Bioconductor package 'GEOquery' to retrieve the GEO dataset, while survival and differential expression analyses are carried out using the 'survival' and 'stats' packages, respectively. For both analyses, shinyGEO produces publication-ready graphics using 'ggplot2' and generates the corresponding R code to ensure that all analyses are reproducible. We demonstrate the capabilities of the tool by using shinyGEO to identify diagnostic and prognostic biomarkers in cancer.

**Speakers**
## Jasmine Dumas

Identifying associations between patient gene expression profiles and clinical data provides insight into the biological processes associated with health and disease. The Gene Expression Omnibus (GEO) is a public repository of gene expression and sequence-based datasets, and currently includes >42,000 datasets with gene expression profiles obtained by microarray. Although GEO has its own analysis tool (GEO2R) for identifying differentially expressed genes, the tool is not designed for advanced data analysis and does not generate publication-ready graphics. In this work, we describe a web-based, easy-to-use tool for biomarker analysis in GEO datasets, called shinyGEO.

shinyGEO is a web-based tool that provides a graphical user interface for users without R programming experience to quickly analyze GEO datasets. The tool is developed using 'shiny', a web application framework for R. Specifically, shinyGEO allows a user to download the expression and clinical data from a GEO dataset, to modify the dataset correcting for spelling and misaligned data frame columns, to select a gene of interest, and to perform a survival or differential expression analysis using the available data. The tool uses the Bioconductor package 'GEOquery' to retrieve the GEO dataset, while survival and differential expression analyses are carried out using the 'survival' and 'stats' packages, respectively. For both analyses, shinyGEO produces publication-ready graphics using 'ggplot2' and generates the corresponding R code to ensure that all analyses are reproducible. We demonstrate the capabilities of the tool by using shinyGEO to identify diagnostic and prognostic biomarkers in cancer.

Graduate Student & Data Scientist

Formerly a Graduate MS Predictive Analytics student at DePaul University; Currently a Graduate MS student at Johns Hopkins Engineering For Professionals studying Computer Science and Data Science. Currently an Associate Data Scientist at The Hartford Insurance Group working on building... Read More →

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Statistics and R for Analysis of Elimination Tournaments
**Poster #8**

There is keen interest in statistical methodology in sports. Such methods are valuable not only to sports sociologists but also those in sports themselves, as exemplified in the book and movie “Moneyball.” These statistics enhance comparisons among players and possibly even enable prediction of games. However, elimination tournaments present special statistical challenges. This paper explores data from the national high school debate circuit, in which the first author was an active national participant. All debaters participate in the 6 pre-elimination rounds, but subsequently the field successively narrows in the elimination rounds. This atypical format makes it difficult to use classical statistical methods, and also requires more sophisticated data wrangling. This paper will use R to explore questions such as: Does gender affect the outcome of rounds? Does geography play a role in wins/losses? What constitutes an upset? Is there a so-called “shadow effect,” in which the weaker the expected competitor in the next round, the greater the probability that the stronger player will win in the current stage? Among the purposes of this project is to use it as an R-based teaching tool, and help the debate community understand the inequalities that exist in relation to gender, region, and school. Typical graphs that can be generated may be viewed at https://github.com/ariel-shin/tourn. Our R software will be available in a package “tourn.”

There is keen interest in statistical methodology in sports. Such methods are valuable not only to sports sociologists but also those in sports themselves, as exemplified in the book and movie “Moneyball.” These statistics enhance comparisons among players and possibly even enable prediction of games. However, elimination tournaments present special statistical challenges. This paper explores data from the national high school debate circuit, in which the first author was an active national participant. All debaters participate in the 6 pre-elimination rounds, but subsequently the field successively narrows in the elimination rounds. This atypical format makes it difficult to use classical statistical methods, and also requires more sophisticated data wrangling. This paper will use R to explore questions such as: Does gender affect the outcome of rounds? Does geography play a role in wins/losses? What constitutes an upset? Is there a so-called “shadow effect,” in which the weaker the expected competitor in the next round, the greater the probability that the stronger player will win in the current stage? Among the purposes of this project is to use it as an R-based teaching tool, and help the debate community understand the inequalities that exist in relation to gender, region, and school. Typical graphs that can be generated may be viewed at https://github.com/ariel-shin/tourn. Our R software will be available in a package “tourn.”

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Teaching statistics to medical students with R and OpenCPU
**Poster #1**

In general medical students do not have or aim at a deeper understanding of statistics. Nevertheless some knowledge of basic statistical reasoning and methodology is indispensable to apprehend the meaning of results of scientific studies published in medical journals. Also, some familiarity with the correct interpretation of probability statements concerning medical tests is crucial for physicians.

In order to supplement our regular statistics classes at the medical faculty we started to develop an online system providing a pool of assignments. Each student gets an individual assignment with a modified data set, asking therefore for a slightly different solution. This enables the system to verify the student’s personal achievement and a data base may keep record of his/her performance.

Our system utilizes OpenCPU installed on a Linux server. The front-end is developed with HTML and JavaScript, while the back-end involves R and MySQL.

The state of the development, the problems, and the students response will be presented.

**Speakers** *JP*
## Joern Pons-Kuehnemann

In general medical students do not have or aim at a deeper understanding of statistics. Nevertheless some knowledge of basic statistical reasoning and methodology is indispensable to apprehend the meaning of results of scientific studies published in medical journals. Also, some familiarity with the correct interpretation of probability statements concerning medical tests is crucial for physicians.

In order to supplement our regular statistics classes at the medical faculty we started to develop an online system providing a pool of assignments. Each student gets an individual assignment with a modified data set, asking therefore for a slightly different solution. This enables the system to verify the student’s personal achievement and a data base may keep record of his/her performance.

Our system utilizes OpenCPU installed on a Linux server. The front-end is developed with HTML and JavaScript, while the back-end involves R and MySQL.

The state of the development, the problems, and the students response will be presented.

Institute for Medical Informatics, Justus Liebig University, Giessen, Germany

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Urban Mobility Modeling using R and Big Data from Mobile Phones
**Poster #22**

There has been rapid urbanization as more and more people migrate into cities. The World Health Organization (WHO) estimates that by 2017, a majority of people will be living in urban areas. By 2030, 5 billion people—60 percent of the world’s population—will live in cities, compared with 3.6 billion in 2013. Developing nations must cope with this rapid urbanization while developed ones wrestle with aging infrastructures and stretched budgets. Transportation and urban planners must estimate travel demand for transportation facilities and use this to plan transportation infrastructure. Presently, the technique used for transportation planning includes the conventional four-step transportation planning model, which makes use of data inputs from local and national household travel surveys. However, local and national household surveys are expensive to conduct, cover smaller areas of cities and the time between surveys range from 5 to 10 years in even some of the most developed cities. This calls for new and innovative ways for Transportation Planning using new data sources.

In recent years, we have witnessed the proliferation of ubiquitous mobile computing devices (inbuilt with sensors, GPS, Bluetooth) that capture the movement of vehicles and people in near real time and generate massive amounts of new data. This study utilizes Call Detail Records (CDR) data from mobile phones and the R programming language to infer travel/mobility patterns. These CDR data contain the locations, time, and dates of billions of phone calls or Short Message Services (SMS) sent or received by millions of anonymized users in Cape Town, South Africa. By analyzing relational dependencies of activity time, duration, and land use, we demonstrate that these new “big” data sources are cheaper alternatives for activity-based modeling and travel behavior studies.

**Speakers**
## Daniel Emaasit

There has been rapid urbanization as more and more people migrate into cities. The World Health Organization (WHO) estimates that by 2017, a majority of people will be living in urban areas. By 2030, 5 billion people—60 percent of the world’s population—will live in cities, compared with 3.6 billion in 2013. Developing nations must cope with this rapid urbanization while developed ones wrestle with aging infrastructures and stretched budgets. Transportation and urban planners must estimate travel demand for transportation facilities and use this to plan transportation infrastructure. Presently, the technique used for transportation planning includes the conventional four-step transportation planning model, which makes use of data inputs from local and national household travel surveys. However, local and national household surveys are expensive to conduct, cover smaller areas of cities and the time between surveys range from 5 to 10 years in even some of the most developed cities. This calls for new and innovative ways for Transportation Planning using new data sources.

In recent years, we have witnessed the proliferation of ubiquitous mobile computing devices (inbuilt with sensors, GPS, Bluetooth) that capture the movement of vehicles and people in near real time and generate massive amounts of new data. This study utilizes Call Detail Records (CDR) data from mobile phones and the R programming language to infer travel/mobility patterns. These CDR data contain the locations, time, and dates of billions of phone calls or Short Message Services (SMS) sent or received by millions of anonymized users in Cape Town, South Africa. By analyzing relational dependencies of activity time, duration, and land use, we demonstrate that these new “big” data sources are cheaper alternatives for activity-based modeling and travel behavior studies.

Graduate Research Assistant, University of Nevada Las Vegas

Broadly, my research interests involve the development of probabilistic machine learning methods for high-dimensional data, with applications to Urban Mobility, Transport Planning, Highway Safety, & Traffic Operations.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Using R in the evaluation of psychological tests
**Poster #17**

Psychological tests are used in many fields, including medicine and education, to assess the cognitive abilities of test takers. According to international standards for psychological testing, psychological tests are required to be reliable, fair, and valid. This presentation illustrates how R can be used to assess the reliability, fairness, and validity of psychological tests using the Tower of London task as an example. In clinical neuropsychology, the Tower of London task is widely used to assess a person’s planning ability. Our data consist of 798 respondents who worked on the 24 test items of the Tower of London – Freiburg Version. By employing the framework of factor analysis and item response theory, it is demonstrated that the number of correctly solved problems in this test can be considered as a reliable and sound indicator for the planning ability of the test takers. It is further demonstrated that the individual problem difficulties remain stable across different levels of age, sex and education, which provides evidence for the test’s fairness. All computations were carried out with the R packages psych, lavaan and eRm, all of which are freely available on CRAN.

Psychological tests are used in many fields, including medicine and education, to assess the cognitive abilities of test takers. According to international standards for psychological testing, psychological tests are required to be reliable, fair, and valid. This presentation illustrates how R can be used to assess the reliability, fairness, and validity of psychological tests using the Tower of London task as an example. In clinical neuropsychology, the Tower of London task is widely used to assess a person’s planning ability. Our data consist of 798 respondents who worked on the 24 test items of the Tower of London – Freiburg Version. By employing the framework of factor analysis and item response theory, it is demonstrated that the number of correctly solved problems in this test can be considered as a reliable and sound indicator for the planning ability of the test takers. It is further demonstrated that the individual problem difficulties remain stable across different levels of age, sex and education, which provides evidence for the test’s fairness. All computations were carried out with the R packages psych, lavaan and eRm, all of which are freely available on CRAN.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Using R with Taiwan Government Open Data to create a tool for monitor the city's age-friendliness
**Poster #20**

Due to rapidly growing aging population,to create a aging-friendly city is a important goal of modern government. Some indexes to reflect the city's age-friendliness may help the local government to monitor and improve the policy practice and these information also should be open and interactive to the citizen who caring about this issue. And the R language provides a great flexibility in dealing with the diversity of the file formats from government. Besides, the data visualization and web application supported by R can make the analysis result more understandable and interactive.

According to WHO 2015 age-friendly city guidelines, there are eight aspects for a comfort of elder living (outdoor spaces, transportation, housing, social participation, social respect, civic participation, communication, health and community support). And we use the Taiwan OpenGovernment data to integrate indexes with normalization and to visualize the indexes geographically. In the end, we create a shiny application with interactive Plotly to let the result easily be approached. The result may show how R can easily to utilize the government data and provide a great application turning WHO guideline into a monitor tool helping the government practice in age-friendly policy.

Due to rapidly growing aging population,to create a aging-friendly city is a important goal of modern government. Some indexes to reflect the city's age-friendliness may help the local government to monitor and improve the policy practice and these information also should be open and interactive to the citizen who caring about this issue. And the R language provides a great flexibility in dealing with the diversity of the file formats from government. Besides, the data visualization and web application supported by R can make the analysis result more understandable and interactive.

According to WHO 2015 age-friendly city guidelines, there are eight aspects for a comfort of elder living (outdoor spaces, transportation, housing, social participation, social respect, civic participation, communication, health and community support). And we use the Taiwan OpenGovernment data to integrate indexes with normalization and to visualize the indexes geographically. In the end, we create a shiny application with interactive Plotly to let the result easily be approached. The result may show how R can easily to utilize the government data and provide a great application turning WHO guideline into a monitor tool helping the government practice in age-friendly policy.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Video Tutorials in Introductory Statistics Instruction
**Poster #14**

I use the Rcmdr package in the introductory statistics course I teach for non-majors. For the past several years I've used video tutorials, in addition to written documents covering the same material, for the lab portion of the course where students use Rcmdr and R to analyze data. All course materials are made available via a content management system that allows me to analyze to what degree students are utilizing various delivery mechanisms. This poster will present how I've assembled the video tutorials as well as usage patterns over the last three course offerings. The associations between tutorial usage type/frequency and student performance in the course are also explored.

I use the Rcmdr package in the introductory statistics course I teach for non-majors. For the past several years I've used video tutorials, in addition to written documents covering the same material, for the lab portion of the course where students use Rcmdr and R to analyze data. All course materials are made available via a content management system that allows me to analyze to what degree students are utilizing various delivery mechanisms. This poster will present how I've assembled the video tutorials as well as usage patterns over the last three course offerings. The associations between tutorial usage type/frequency and student performance in the course are also explored.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Visualization of health and population indicators within urban African populations using R
**Poster #23**

The Demographic and Health Surveys (DHS) Program has collected and disseminated open data on population and health through more than 300 surveys from various countries. One of our research interests is to investigate the linkage between urban poverty and health in African countries. Using the DHS raw data we have computed indicators focusing on exploring how the indicators differ between different groups in the urban areas. These groups are based on wealth tertiles and consist of the urban poor, urban middle and the urban rich.

Following the analysis we have developed the Urban Population and Health Data Visualization Platform which is an interactive web application using Shiny. Online deployment of the platform through the APHRC website is underway and we believe it will assist policymakers and researchers to perform data explorations and gather actionable insights. By sharing the code through github we hope that it will contribute towards promoting the adoption of R particularly by universities and researchers in Africa as an alternative to costly proprietary statistical software.

The platform showcases the power of R and is developed using R and various R packages including shiny, ggplot, googleVis, RCharts, DT for graphics and dplyr for data manipulation.

The Demographic and Health Surveys (DHS) Program has collected and disseminated open data on population and health through more than 300 surveys from various countries. One of our research interests is to investigate the linkage between urban poverty and health in African countries. Using the DHS raw data we have computed indicators focusing on exploring how the indicators differ between different groups in the urban areas. These groups are based on wealth tertiles and consist of the urban poor, urban middle and the urban rich.

Following the analysis we have developed the Urban Population and Health Data Visualization Platform which is an interactive web application using Shiny. Online deployment of the platform through the APHRC website is underway and we believe it will assist policymakers and researchers to perform data explorations and gather actionable insights. By sharing the code through github we hope that it will contribute towards promoting the adoption of R particularly by universities and researchers in Africa as an alternative to costly proprietary statistical software.

The platform showcases the power of R and is developed using R and various R packages including shiny, ggplot, googleVis, RCharts, DT for graphics and dplyr for data manipulation.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Visualizations and Machine Learning in R with Tessera and Shiny
**Poster #19**

In a divide and recombine (D&R) paradigm, the Tessera tool suite of packages (https://tessera.io), developed at Pacific Northwest National Laboratory, presents a method for dynamic and flexible exploratory data analysis and visualization. At the front end of Tessera, analysts program in the R programming language, while the back end utilizes a distributed parallel computational environment. Using these tools, we have created an interactive display where users can explore visualizations and statistics on a large dataset from the National Football League (NFL). These visualizations allow any user to interact with the data in meaningful ways, leading to an in depth analysis of the data through general summary statistics as well as insights on fine grain information. In addition, we have incorporated an unsupervised machine learning scheme utilizing an interactive R Shiny application that predicts positional rankings for NFL players. We have showcased these tools using a variety of available data from the NFL in order to make the displays easily interpretable to a wide audience. Our results, fused into an interactive display, illustrate Tessera’s efficient exploratory data analysis capabilities and provide examples of the straightforward programming interface.

In a divide and recombine (D&R) paradigm, the Tessera tool suite of packages (https://tessera.io), developed at Pacific Northwest National Laboratory, presents a method for dynamic and flexible exploratory data analysis and visualization. At the front end of Tessera, analysts program in the R programming language, while the back end utilizes a distributed parallel computational environment. Using these tools, we have created an interactive display where users can explore visualizations and statistics on a large dataset from the National Football League (NFL). These visualizations allow any user to interact with the data in meaningful ways, leading to an in depth analysis of the data through general summary statistics as well as insights on fine grain information. In addition, we have incorporated an unsupervised machine learning scheme utilizing an interactive R Shiny application that predicts positional rankings for NFL players. We have showcased these tools using a variety of available data from the NFL in order to make the displays easily interpretable to a wide audience. Our results, fused into an interactive display, illustrate Tessera’s efficient exploratory data analysis capabilities and provide examples of the straightforward programming interface.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Writing a dplyr backend to support out-of-memory data for Microsoft R Server
**Poster #10**

Over the last two years, the dplyr package has become very popular in the R community for the way it streamlines and simplifies many common data manipulation tasks. A feature of dplyr is that it’s extensible; by defining new methods, one can make it work with data sources other than those it supports natively. The dplyrXdf package is a backend that extends dplyr functionality to Microsoft R Server’s xdf files, which are a way of overcoming R’s in-memory limitations. dplyrXdf supports all the major dplyr verbs, pipeline notation, and provides some additional features to make working with xdfs easier. In this talk, I’ll share my experiences writing a new back-end for dplyr, and demonstrate how to use dplyr and dplyrXdf to carry out data wrangling tasks on large datasets that exceed the available memory.

Over the last two years, the dplyr package has become very popular in the R community for the way it streamlines and simplifies many common data manipulation tasks. A feature of dplyr is that it’s extensible; by defining new methods, one can make it work with data sources other than those it supports natively. The dplyrXdf package is a backend that extends dplyr functionality to Microsoft R Server’s xdf files, which are a way of overcoming R’s in-memory limitations. dplyrXdf supports all the major dplyr verbs, pipeline notation, and provides some additional features to make working with xdfs easier. In this talk, I’ll share my experiences writing a new back-end for dplyr, and demonstrate how to use dplyr and dplyrXdf to carry out data wrangling tasks on large datasets that exceed the available memory.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Literate Programming

**Moderators**
## Susan Holmes

**Speakers**

The speaker will discuss what he considers to be the most important outcome of his work developing TeX in the 1980s, namely the accidental discovery of a new approach to programming --- which caused a radical change in his own coding style. Ever since then, he has aimed to write programs for human beings (not computers) to read. The result is that the programs have fewer mistakes, they are easier to modify and maintain, and they can indeed be understood by human beings. This facilitates reproducible research, among other things.

Professor, Statistics, Stanford

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

Tuesday June 28, 2016 3:30pm - 4:30pm PDT

McCaw Hall

McCaw Hall

Covr: Bringing Code Coverage to R

Code coverage records whether or not each line of code in a package is executed by the package's tests. While it does not check whether a given program or test executes properly it does reveal areas of the code which are untested. Coverage has a long history in the computer science community (Miller and Maloney in Communications of the ACM, 1963), unfortunately the R language has lacked a comprehensive and easy to use code coverage tool. The covr package was written to make it simple to measure and report test coverage for R, C, C++ and Fortran code in R packages. It has measurably improved testing for numerous packages and also serves as an informative indicator of package reliability. Covr is now used routinely by over 1000 packages on CRAN, Bioconductor and GitHub. I will discuss how covr works, how it is best used and how it has demonstrably improved test coverage in R packages since its release.

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

Econ 140

Econ 140

flexdashboard: Easy interactive dashboards for R

Recently, dashboards have become a common means of communicating the results of data analysis, especially of real-time data, and with good reason: dashboards present information attractively, use space efficently, and offer eye-catching visualizations that make it easy to consume information at a glance. Traditionally, however, dashboards have been difficult to construct using tools readily available to R users, and so are built by a separate engineering team if they're built at all. In this talk, we present a new package, $\pkg{flexdashboard}$, which empowers R users to build fully-functioning dashboards. To make this possible, $\pkg{flexdashboard}$ leverages two existing packages: $\pkg{R Markdown}$ and $\pkg{Shiny}$. $\pkg{R Markdown}$ provides a means to describe the dashboard's content and layout using simple text constructs; and, optionally, $\pkg{Shiny}$ enables interactivity among components and allows the full analytic power of R to be used at runtime. The talk will focus on the practical steps involved in setting up a dashboard using $\pkg{flexdashboard}$, including: Building space-filling layouts using declarative R Markdown directives; Using dashboard components, such as tables, value boxes, charts, and more; Constructing multi-page dashboards for the presentation of larger or more detailed results; and Adding interactivity using Shiny.

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

How to keep your R code simple while tackling big datasets

**Moderators**
## Gabriela de Queiroz

**Speakers**
## Chuck Piercey

Like many statistical analytic tools, R can be incredibly memory intensive. A simple GAM (generalized additive model) or K-nearest neighbor routine can devour many multiples of memory size compared to the starting dataset. And, R doesn't always behave nicely when it runs out of memory.

There are techniques to get around memory limitations, like using partitioning tools or sampling down. But these require extra work. It would be really nice to run elegantly simple R analytics without that hassle.

Using a really big, public dataset, from CMS.gov, Chuck will show GAM, GLM, Decision Trees, Random Forest and K Nearest Neighbor routines that were prototyped and run on a laptop then run unchanged on a single simple Linux instance with over a Terabyte of RAM against the entire dataset. This big computer is actually a collection of smaller off-the-shelf servers using TidalScale to create a single, virtual server with several terabytes of RAM.

There are techniques to get around memory limitations, like using partitioning tools or sampling down. But these require extra work. It would be really nice to run elegantly simple R analytics without that hassle.

Using a really big, public dataset, from CMS.gov, Chuck will show GAM, GLM, Decision Trees, Random Forest and K Nearest Neighbor routines that were prototyped and run on a laptop then run unchanged on a single simple Linux instance with over a Terabyte of RAM against the entire dataset. This big computer is actually a collection of smaller off-the-shelf servers using TidalScale to create a single, virtual server with several terabytes of RAM.

Sr. Developer Advocate/Manager, IBM

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

KumoScale Product Management, Kioxia

B2B software product management & marketing. Writer: https://medium.com/@chuck1.piercey

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

mlrMBO: A Toolbox for Model-Based Optimization of Expensive Black-Box Functions

**Moderators**

**Speakers**
## Jakob Richter

Many practical optimization tasks, such as finding best parameters for simulators in engineering or hyperparameter optimization in machine learning, are of a black-box nature, i.e., neither formulas of the objective nor derivative information is available. Instead, we can only query the box for its objective value at a given point. If such a query is very time-consuming, the optimization task becomes extremely challenging, as we have to operate under a severely constrained budget of function evaluations. A modern approach is sequential model based-optimization, aka Bayesian optimization. Here, a surrogate regression model learns the relationship between decision variables and objective outcome. Sequential point evaluations are planned to simultaneously exploit the so far learnt functional landscape and to ensure exploration of the search space. A popular instance of this general principle is the EGO algorithm, which uses Gaussian processes coupled with the expected improvement criterion for point proposal. The mlrMBO package offers a rich interface to many variants of model-based optimization. As it builds upon the mlr package for machine learning in R, arbitrary surrogate regression models can be applied. It offers a wide variety of options to tackle different black-box scenarios: - Optimization of pure continuous as well as mixed continuous-categorical search spaces.- Single criteria optimization or approximated Pareto fronts for multi-criteria problems. - Single point proposal or parallel batch point planning during optimization. The package is designed as a convenient, easy-to-use toolbox of popular state-of-the-art algorithms, but can also be used as as a research framework for algorithm designers.

TU Dortmund University

- Interested in Machine Learning and Hyperparameter Tuning/Optimization.
- Especially through Model Based Optimization using Kriging or Random Forest.
- Parallel Computation

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

SIEPR 130

SIEPR 130

R at Microsoft

**Moderators**

**Speakers**
## David Smith

Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.

In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I’ll describe a couple of examples of R being used to analyze operational data at Microsoft. I’ll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.

In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I’ll describe a couple of examples of R being used to analyze operational data at Microsoft. I’ll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.

Cloud Advocate, Microsoft

Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

McCaw Hall

McCaw Hall

Simulation of Synthetic Complex Data: The R-Package simPop

**Moderators**
## Thomas Petzoldt

**Speakers**

The production of synthetic datasets has been proposed as a statistical disclosure control solution to generate public use files out of protected data, and as a tool to create "augmented datasets" to serve as input for micro-simulation models. The performance and acceptability of such a tool relies heavily on the quality of the synthetic populations, i.e., on the statistical similarity between the synthetic and the true population of interest. Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorized into three main groups: synthetic reconstruction, combinatorial optimization, and model-based generation. We introduce simPop, an open source data synthesizer. SimPop is a user-friendly R-package based on a modular object-oriented concept. It provides a highly optimized S4 class implementation of various methods, including calibration by iterative proportional fitting and simulated annealing, and modeling or data fusion by logistic regression and other methods.

Senior Scientist, TU Dresden (Dresden University of Technology)

dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

Tuesday June 28, 2016 4:45pm - 5:03pm PDT

SIEPR 120

SIEPR 120

'AF' a new package for estimating the attributable fraction

**Moderators**

**Speakers**
## Elisabeth Dahlqwist

The attributable fraction (or attributable risk) is a widely used measure that quantifies the public health impact of an exposure on an outcome. Even though the theory for AF estimation is well developed, there has been a lack of up-to-date software implementations. The aim of this article is to present a new R package for AF estimation with binary exposures. The package AF allows for confounder-adjusted estimation of the AF for the three major study designs: cross-sectional, (possibly matched) case-control and cohort. The article is divided into theoretical sections and applied sections. In the theoretical sections we describe how the confounder-adjusted AF is estimated for each specific study design. These sections serve as a brief but self-consistent tutorial in AF estimation. In the applied sections we use real data examples to illustrate how the AF package is used. All datasets in these examples are publicly available and included in the AF package, so readers can easily replicate all analyses.

Karolinska Institute

I am a Phd student at Karolinska Institutet, Department of medical epidemiology and biostatistics. The topic of my PhD is methodological developments of the attributable fraction. My first project have been to implement a R package which unifies methods for estimating the model-based... Read More →

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

McCaw Hall

McCaw Hall

Crowd sourced benchmarks

**Moderators**

**Speakers**

One of the simplest ways to speed up your code is to buy a faster computer. While this advice is certainly trite, it is something that should still be considered. However it is often unclear to determine the benefit of upgrading your system. The `benchmarkme` package aims to tackle this question by allowing users to benchmark their system and compare their results with other users. This talk will discuss the results of this benchmarking exercise. Additionally we'll provide practical advice about how to move your system up the benchmark rankings through byte compiling and using alternate BLAS libraries.

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

Econ 140

Econ 140

Deep Learning for R with MXNet

MXNet is a multi-language machine learning library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems. The MXNet R package brings flexible and efficient GPU computing and state-of-art deep learning to R. It enables users to write seamless tensor/matrix computation with multiple GPUs in R. It also enables users to construct and customize the state-of-art deep learning models in R, and apply them to tasks such as image classification and data science challenges. Due to the portable design, the MXNet R package can be installed and used on all operating systems supporting R, including Linux, Mac and Windows. In this talk I will provide an overview of the MXNet platform. With demos of state-of-art deep learning models, users can build and modify deep neural networks according to their own need easily. At the same time, the GPU backend will ensure the efficiency of all computing work.

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

SIEPR 130

SIEPR 130

Inside the Rent Zestimates

**Moderators**
## Gabriela de Queiroz

**Speakers**
## Yeng Bun

Zillow, the leading real estate and rental marketplace in USA, uses R to estimate home values (Zestimates) and rental prices (Rent Zestimates). Every day, we refresh Zestimates and Rent Zestimates on Zillow.com for more than 100 million homes with output from scoring daily re-trained models. The model training and scoring infrastructure rests on top of R, allowing rapid prototyping and deployment to production servers. We make extensive use of R, including the development of an in-house package called ZPL that functions similar to MapReduce on Hadoop, but runs on relational databases. In this presentation, we will go under the hood to see parts of the engine that power the Rent Zestimates.

Sr. Developer Advocate/Manager, IBM

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

Principal Data Scientist, Zillow

Analyzing, modeling and programing with real-estate data as well as production software in R.

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

New Paradigms In Shiny App Development: Designer + Data Scientist Pairing

With the help of Shiny, advanced analytics practitioners have been liberated from professional application development constraints: long-turn development cycles, difficult interactions with IT groups unfamiliar with statistical modelling, challenges in making their content more accessible to broad audiences, and steep resource/time costs. However, this has pushed the burden of UX design, graphical presentation and scaling decisions onto the shoulders of the data scientist, who may or may not have a good background in these fields.

Now that supporting capabilities exist, such as packages that make interfacing with JavaScript visualization libraries easier (htmlwidgets) and the recent release of new shiny features, the work effort can be split, and much more compelling products can be produced. We plan to discuss a real-life example of creating a shiny application with HTML Templates, modules, etc. with support from a web-design expert. We’ll describe the process of how we worked together to build basic prototypes, the benefits of shared work, and the challenges involved with such diverse skill sets. Finally, we’ll show an example application built for a pricing and promotions model, and describe the impact this toolset had for us and our clients.

Now that supporting capabilities exist, such as packages that make interfacing with JavaScript visualization libraries easier (htmlwidgets) and the recent release of new shiny features, the work effort can be split, and much more compelling products can be produced. We plan to discuss a real-life example of creating a shiny application with HTML Templates, modules, etc. with support from a web-design expert. We’ll describe the process of how we worked together to build basic prototypes, the benefits of shared work, and the challenges involved with such diverse skill sets. Finally, we’ll show an example application built for a pricing and promotions model, and describe the impact this toolset had for us and our clients.

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

When will this machine fail?

**Moderators**
## Thomas Petzoldt

**Speakers**

In this talk, we demonstrate how to develop and deploy end-to-end machine learning solutions for predictive maintenance in manufacturing industry with R. For predictive maintenance, the following questions regarding when a machine fails are typically asked: what's the Remaining Useful Life (RUL) of an asset? Will an asset fail within a given time frame? Which time window will an asset likely fail? We formulate the above questions to regression, binary classification and multiclass classification problems respectively, and use a public aircraft engine data to demonstrate the complete modeling steps in R: data labeling, processing, feature engineering, model training and evaluation. R users are often challenged with productizing the models they built. After model development, we will show two ways of productization: 1) deploy with SQL server as stored procedures using the new R services; 2) deploy it by publishing as a web service restful API; Either approach would enable user to call the deployed scoring engine from any applications. The presentation will be followed by a live demo during the talk.

Senior Scientist, TU Dresden (Dresden University of Technology)

dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

Tuesday June 28, 2016 5:03pm - 5:21pm PDT

SIEPR 120

SIEPR 120

broom: Converting statistical models to tidy data frames

**Moderators**

**Speakers**

The concept of "tidy data" offers a powerful and intuitive framework for structuring data to ease manipulation, modeling and visualization, and has guided the development of R tools such as ggplot2, dplyr, and tidyr. However, most functions for statistical modeling, both built-in and in third-party packages, produce output that is not tidy, and that is therefore difficult to reshape, recombine, and otherwise manipulate. I introduce the package “broom," which turns the output of model objects into tidy data frames that are suited to further analysis and visualization with input-tidy tools. The package defines the tidy, augment, and glance methods, which arrange a model into three levels of tidy output respectively: the component level, the observation level, and the model level. These three levels can be used to describe many kinds of statistical models, and offer a framework for combining and reshaping analyses using standardized methods. Along with the implementations in the broom package, this offers a grammar for describing the output of statistical models that can be applied across many statistical programming environments, including databases and distributed applications.

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

McCaw Hall

McCaw Hall

Data validation infrastructure: the validate package

**Moderators**
## Gabriela de Queiroz

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

**Speakers**

Data validation consists of checking whether data agrees with prior knowledge or assumptions about the process that generated the data, including collecting it. Such knowledge can often be expressed as a set of short statements, or rules, which the data must satisfy in order to be acceptable for further analyses.

Such rules may be of technical nature or express domain knowledge. For example, domain knowledge rules include 'Someone who is unemployed can not have an employer (labour force survey)', 'the total profit and cost of an organization must add up to the total revenue (business survey)' and the price of a product in this period must lie within 20% of last year's price (in consumer price index data).

Data validation is an often recurring step in a multi-step data cleaning process where the progress of data quality is monitored throughout. For this reason, the validate package allows one to define data validation rules externally, confront them with data and gather and visualize results.

With the validate package, data validation rules become objects of computation that can be maintained, manipulated and investigated as separate entities. For example, it becomes possible to automatically detect contradictions in certain classes of rule sets. Maintenance is supported by import and export from and to free text or yaml files, allowing rules to be endowed with metadata.

Such rules may be of technical nature or express domain knowledge. For example, domain knowledge rules include 'Someone who is unemployed can not have an employer (labour force survey)', 'the total profit and cost of an organization must add up to the total revenue (business survey)' and the price of a product in this period must lie within 20% of last year's price (in consumer price index data).

Data validation is an often recurring step in a multi-step data cleaning process where the progress of data quality is monitored throughout. For this reason, the validate package allows one to define data validation rules externally, confront them with data and gather and visualize results.

With the validate package, data validation rules become objects of computation that can be maintained, manipulated and investigated as separate entities. For example, it becomes possible to automatically detect contradictions in certain classes of rule sets. Maintenance is supported by import and export from and to free text or yaml files, allowing rules to be endowed with metadata.

Sr. Developer Advocate/Manager, IBM

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Interactive Naive Bayes using Shiny: Text Retrieval, Classification, Quantification

**Moderators**

**Speakers**
## Giorgio Maria Di Nunzio

Interactive Machine Learning (IML) is a relatively new area of ML where focused interactions between algorithms and humans allow for faster and more accurate model updates with respect to classical ML algorithms. By involving users directly in the process of optimizing the parameters of the ML model, it is possible to quickly improve the effectiveness of the model and also to understand why some values of the parameters of the model work better than others through low-cost trial and error and experimentation with inputs and outputs. In this talk, we show three interactive applications developed with the Shiny package on the problems of text retrieval, text classification and text quantification. These applications implement a probabilistic model that use the Naïve Bayes (NB) assumption which has been widely recognised as a good trade-off between efficiency and efficacy, but it achieves satisfactory results only when optimized properly. All these three applications provide a two-dimensional representation of probabilities that has been inspired by the approach named Likelihood Spaces. This representation provides an adequate data visualization to understand how parameters and costs optimization affects the performance of the retrieval/classification/quantification application in a real machine learning setting on standard text collections. We will show that this particular geometrical interpretation of the probabilistic model together with the interaction significantly improves not only the performance but also the understanding of the models and opens new perspectives for new research studies.

Assistant Professor, Department of Information Engineering, University of Padua

Automated Text Classification
Naive Bayes
Cost Sensitive Learning
Interactive Machine Learning
Gamification for Machine Learning

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

SIEPR 130

SIEPR 130

Introducing the permutations package

**Moderators**
## Thomas Petzoldt

**Speakers**

A 'permutation' is a bijection from a finite set to itself. Permutations are important and interesting objects in a range of mathematical contexts including group theory, recreational mathematics, and the study of symmetry. This short talk will introduce the 'permutations' R package for manipulation and display of permutations. The package has been used for teaching pure mathematics, and contains a number of illustrative examples. The package is fully vectorized and is intended to provide R-centric functionality in the context of elementary group theory. The package includes functionality for working with the "megaminx", a dodecahedral puzzle with similar construction to the Rubik cube; the megaminx puzzle is a pleasing application of group theory and the package was written specifically to analyze the megaminx. From a group-theoretic perspective, the center of the megaminx group comprises a single non-trivial element, the `superflip'. The superflip has a distinctive and attractive appearance and one computational challenge is to find the shortest sequence that accomplishes the superflip. Previously, the best known result was a superflip of 83 turns, due to Clarke. The presentation will conclude by showing one result of the permutations package: an 82-turn superflip.

Senior Scientist, TU Dresden (Dresden University of Technology)

dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

SIEPR 120

SIEPR 120

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database (BAAD)

**Moderators**

**Speakers**

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments -- the outputs of many and isolated scientific studies conducted around the globe. Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. BAAD is unique in that the workflow -- from raw fragments to homogenised database -- is entirely open and reproducible. In this talk I introduce BAAD and illustrate solutions (using R) for some of the challenges of working with and distributing lots and lots of #otherpeople's data.

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

Econ 140

Econ 140

Using Shiny modules to build more-complex and more-manageable apps

**Moderators**

**Speakers**

The release of Shiny 0.13 includes support for modules, allowing you to build Shiny apps more quickly and more reliably. Furthermore, using Shiny modules makes it easier for you to build more-complex apps, because the interior complexity of each module is hidden from the level of the app. This allows you, as a developer, to focus on the complexity of the app at the system-level, rather than at the module-level. For example, there are open-source shiny modules that: read a time-indexed csv file then parse it into a dataframe, visualize a time-indexed dataframe using dygraphs, and write a dataframe to a csv file to be downloaded. Modules are simply collections of functions that can be organized into, and called from, packages. The primary focus of this presentation will be on how modules from the "shinypod" package can be assembled to into "simple" shiny apps. As well, there will be demonstrations of more-complex apps built using modules. In this case, Shiny apps are built as interfaces to web-services, allowing you to evaluate the usefulness of suites of web-services without having to be immediately concerned with the API clients. Time permitting, there could be some discussion of how Shiny modules are put together.

Tuesday June 28, 2016 5:21pm - 5:39pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

A Future for R

**Moderators**

**Speakers**

A future is an abstraction for a value that may be available at some point in the future and which state is either unresolved or resolved. When a future is resolved the value is readily available. How and when futures are resolved is given by their evaluation strategies, e.g. synchronously in the current R session or asynchronously on a compute cluster or in background processes. Multiple asynchronous futures can be created without blocking the main process providing a simple yet powerful construct for parallel processing. It is only when the value of an unresolved future is needed it blocks. We present the future package which defines a unified API for using futures in R, either via explicit constructs f <- future({ expr }) and v <- value(f) or via implicit assignments (promises) v %<-% { expr }. From these it is straightforward to construct classical *apply() mechanism. The package implements synchronous eager and lazy futures as well as multiprocess (single-machine multicore and multisession) and cluster (multi-machine) futures. Additional future types can be implemented by extending the future package, e.g. BatchJobs and BiocParallel futures. We show that, because of the unified API and because global variables are automatically identified and exported, an R script that runs sequentially on the local machine can with a single change of settings run in, for instance, a distributed fashion on a remote cluster with values still being collected on the local machine. The future package is cross-platform and available on CRAN with source code on GitHub (https://github.com/HenrikBengtsson/future).

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

Econ 140

Econ 140

Differential equation-based models in R: An approach to simplicity and performance

**Moderators**
## Thomas Petzoldt

dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

The world is a complex dynamical system, a system evolving in time and space in which numerous interactions and feedback loops produce phenomena that defy simple explanations. Differential-equation models are powerful tools to improve understanding of dynamic systems and to support forecasting and management in applied fields of mathematics, natural sciences, economics and business. While lots of effort has been put into the fundamental scientific tools, applying these to specific systems requires significant programming and re-implementation. The resulting code is often quite technical, hindering communication and maintenance. We present an approach to: (1) make programming more generic, (2) generate code with high performance (3) improve sustainability, and (4) support communication between modelers, programmers and users by: - automatic generation of Fortran code (package rodeo) from spreadsheet tables containing state variables, parameters, processes, interactions and documentation, - numerical solution with general-purpose solvers (package deSolve), - web-based interfaces (package shiny), that can be designed manually or auto-generated from the model tables (package rodeoApp), - creation of docs in LaTeX or HTML. Package rodeo uses a stoichiometry-matrix notation (Petersen matrix) of reactive transport models and can generate R or Fortran code for ordinary and 1D partial differential equation models, e.g. with longitudinal or vertical structure. The suitability of the approach will be shown with two ecological models of different complexity: (1) antibiotic resistance gene transfer in the lab, (2) algae bloom control in a lake.

Senior Scientist, TU Dresden (Dresden University of Technology)

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

SIEPR 120

SIEPR 120

R AnalyticFlow 3: Interactive Data Analysis GUI for R

**Moderators**

**Speakers**
## Ryota Suzuki

R AnalyticFlow 3 is an open-source GUI for data analysis on top of R. It is designed to simplify the process of data analysis for both R experts and beginners. It is written in Java and runs on Windows, OS X and Linux. Interactive GUI modules are available to perform data analysis without writing code, or you can write R scripts if you prefer. Then you can connect these modules (or scripts) to build an “analysis flow”, which is a workflow representing the processes of data analysis. An analysis flow can be executed by simple mouse operation, which facilitates collaborative works among people with different fields of expertise. R AnalyticFlow 3 is extensible: you can easily build custom GUI modules to add functions that you need. Custom module builder is available for this purpose, which itself is a simple, user-friendly GUI to design custom modules. Any R function including your original R script can be converted to a GUI module. It also provides typical tools such as code editor, object/file browser, graphics device, help browser and R console. There are also many useful features including code completion, debugger, object-caching, auto-backup and project manager. R AnalyticFlow 3 is freely available from our website (www.ef-prime.com). The source code is licensed under LGPL, and works with other open-source libraries including JRI, JUNG and Substance.

CEO / Data Analyst, Ef-prime, Inc.

In my talk I will introduce R AnalyticFlow ( r.analyticflow.com ), an R-GUI for interactive data analysis. We are developing this software in Ef-prime, Inc. ( www.ef-prime.com ), a company that provides consulting and related services in data analysis. I'm also one of the authors... Read More →

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Rho: High Performance R

The Rho project (formerly known as CXXR) is working on transforming the current R interpreter into a high performance virtual machine for R. Using modern software engineering techniques and the research done on VMs for dynamic and array languages over the last twenty years, we are targeting a factor of ten speed improvement or better for most types of R code, while retaining full compatibility.

This talk will discuss the current compatibility and performance of the VM, the types of tasks it currently does well and outline the project's roadmap for the next year.

This talk will discuss the current compatibility and performance of the VM, the types of tasks it currently does well and outline the project's roadmap for the next year.

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

McCaw Hall

McCaw Hall

RServer: Operationalizing R at Electronic Arts

**Moderators**
## Gabriela de Queiroz

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

**Speakers**
## Ben Weber

The motivation for the RServer project is the ability for data scientists at Electronic Arts to offload R computations from their personal machines to the cloud and to enable modeling at scale. The outcome of the project is a web-based tool that our scientists can use to automate running R scripts on virtual machines and perform a variety of reporting and analysis tasks. We are using RServer to operationalize data science at EA.

The core of RServer is a java application that leverages WAMP to provide a web front-end for managing deployed R scripts. At Electronic Arts, we host RServer instances on our on-demand infrastructure and can spin up new machines as necessary to support new products and analyses. In order to deploy scripts to the server, team members check in their R scripts and supporting files to Perforce and modify the server’s schedule file.

In addition to the web tool, we’ve developed an internal R package that provides functionality for connecting to our data sources, password management, and additional features for RServer, such as making R Markdown reports accessible via URLs. Some uses of RServer include running ETLs, creating and emailing R Markdown reports, and hosting dashboards and Shiny applications. This infrastructure enables us to power analytics in a way that is usually reserved for tools like Tableau, while utilizing the full power of R. This presentation will include a live demo of RServer. We are releasing an open-source version of RServer at useR! that supports Git.

The core of RServer is a java application that leverages WAMP to provide a web front-end for managing deployed R scripts. At Electronic Arts, we host RServer instances on our on-demand infrastructure and can spin up new machines as necessary to support new products and analyses. In order to deploy scripts to the server, team members check in their R scripts and supporting files to Perforce and modify the server’s schedule file.

In addition to the web tool, we’ve developed an internal R package that provides functionality for connecting to our data sources, password management, and additional features for RServer, such as making R Markdown reports accessible via URLs. Some uses of RServer include running ETLs, creating and emailing R Markdown reports, and hosting dashboards and Shiny applications. This infrastructure enables us to power analytics in a way that is usually reserved for tools like Tableau, while utilizing the full power of R. This presentation will include a live demo of RServer. We are releasing an open-source version of RServer at useR! that supports Git.

Sr. Developer Advocate/Manager, IBM

Sr. Data Scientist, Electronic Arts

Senior Data Scientist at Electronic Arts. Studied game AI and machine learning at UC Santa Cruz. @bgweber

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

xgboost: An R package for Fast and Accurate Gradient Boosting

XGBoost is a multi-language library designed and optimized for boosting trees algorithms. The underlying algorithm of xgboost is an extension of the classic gradient boosting machine algorithm. By employing multi-threads and imposing regularization, xgboost is able to utilize more computational power and get more accurate prediction compared to the traditional version. Moreover, a friendly user interface and comprehensive documentation are provided for user convenience. The package has been downloaded for more than 4,000 times on average from CRAN per-month, and the number is growing rapidly. It has now been widely applied in both industrial business and academic researches. The R package has won the 2016 John M. Chambers Statistical Software Award. From the very beginning of the work, our goal is to make a package which brings convenience and joy to the users. In this talk, I will briefly introduce the usage of xgboost, as well as several highlights that we think users would love to know.

Tuesday June 28, 2016 5:39pm - 5:57pm PDT

SIEPR 130

SIEPR 130

Applying R in streaming and Business Intelligence applications

**Moderators**
## Gabriela de Queiroz

Gabriela de Queiroz is a Sr. Engineering & Data Science Manager and a Sr. Developer Advocate at IBM where she leads the CODAIT Machine Learning Team. She works in different open source projects and is actively involved with several organizations to foster an inclusive community. She... Read More →

**Speakers**

R provides tremendous value to statisticians and data scientists. However, they are often challenged to integrate their work and extend that value to the rest of their organization. This presentation will demonstrate how the R language can be used in Business Intelligence applications (such as Financial Planning and Budgeting, Marketing Analysis, and Sales Forecasting) to put advanced analytics into the hands of a wider pool of decisions makers. We will also show how R can be used in streaming applications (such as TIBCO Streambase) to rapidly build, deploy and iterate predictive models for real-time decisions. TIBCO's enterprise platform for the R language, TIBCO Enterprise Runtime for R (TERR) will be discussed, and examples will include fraud detection, marketing upsell and predictive maintenance.

Sr. Developer Advocate/Manager, IBM

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Colour schemes in data visualisation: Bias and Precision

**Moderators**

**Speakers**

The technique of mapping continuous values to a sequence of colours, is often used to visualise quantitative data. The ability of different colour schemes to facilitate data interpretation has not been thoroughly tested. Using a survey framework built with Shiny and loggr, we compared six commonly used colour schemes in two experiments: a measure of perceptually linearity and a map reading task for: (1) bias and precision in data interpretation, (2) response time and (3) colour preferences. The single-hue schemes were unbiased — perceived values did not consistently deviate from the true value, but very imprecise — large data variance between the perceived values. Schemes with hue transitions improved precision, however they were highly biased when not close to perceptually linearity (especially for the multi-hue ‘rainbow’ schemes). Response time was shorter for the single-hue schemes and longer for more complex colour schemes. There was no aesthetic preference for any of the colourful schemes. These results show that in choosing a colour scheme to communicate quantitative information, there are two potential pitfalls: bias and precision. Every use of colour to represent data should be aware of the bias--precision trade-off and select the scheme that balances these two potential communication errors.

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

McCaw Hall

McCaw Hall

Helping R Stay in the Lead by Deploying Models with PFA

**Moderators**
## Thomas Petzoldt

dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

**Speakers**
## Stuart Bailey

We introduce a new language for deploying analytic models into products, services and operational systems called the Portable Format for Analytics (PFA). PFA is an example of what is sometimes called a model interchange format, a standard and domain specific language for describing analytic models that is independent of specific tools, applications or systems. Model interchange formats allow one application (the model producer) to export models and another application (the model consumer or scoring engine) to import models. The core idea behind PFA is to support the safe execution of statistical functions, mathematical functions, and machine learning algorithms and their compositions within a safe execution environment. With this approach, the common analytic models used in data science can be implemented, as well as the data transformations and data aggregations required for pre- and post-processing data. We will discuss the deployment of models developed in R using PFA, why PFA is strategically important for the R community, and the current state of R libraries for PFA exporting and manipulation of models developed in R.

Senior Scientist, TU Dresden (Dresden University of Technology)

CTO, Open Data Group

I'm a technologist and entrepreneur who has been focused on analytic and data intensive distributed systems for over two decades. At Open Data Group we are focused on improving processes and technologies for deploying analytics such as those created with R. We refer to the capabilities... Read More →

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

SIEPR 120

SIEPR 120

Rectools: An Advanced Recommender System

Recommendation engines have a number of different applications. From books to movies, they enable the analysis and prediction of consumer preferences. The prevalence of recommender systems in both the business and computational world has led to clear advances in prediction models over the past years. Current R packages include recosystem and recommenderlab. However, our new package, rectools, currently under development, extends its capabilities in several directions. One of the most important differences is that rectools allows users to incorporate covariates, such as age and gender, to improve predictive ability and better understand consumer behavior. Our software incorporates a number of different methods, such as non-negative matrix factorization, random effects models, and nearest neighbor methods. In addition to our incorporation of covariate capabilities, rectools also integrates several kinds of parallel computation. Examples of real data will be presented, and results of computational speedup experiments will be reported; results so far have been very encouraging. Code is being made available on GitHub, at https://github.com/Pooja-Rajkumar/rectools.

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

SIEPR 130

SIEPR 130

Rethinking R Documentation: an extension of the lint package

In this presentation I will present an extension to the lint package to assist with documentation of R objects. R is the de facto standard for literate programming thanks to packages such as. However, R still falls behind competing languages in the area of documentation. In Steve McConnell's classic Code Complete (2004) his first principle of commenting routines is "keep comments close to the code they describe." The native documentation system for R requires separate files. Packages have been developed that improve the situation, however unlike Doxygen, on which they are based, they do not allow full mixing code with documentation. I propose a paradigm shift for R documentation, which I have implemented in the R package lint. This strategy allows for several subtle changes in documentation, while seeking to preserve as much previous capability as is reasonable. First is to store documentation as an R object itself, allowing for documentation to be dynamically generated and manipulated in code. Documentation can also be kept as an attribute of the function or object that it documents and can exist independent from a package. The second change is that the documentation engine makes full use of the R parser. This integrates code with documentation comments and allows tailoring meaning to location. These extensions give more capability to programmers and users of R easing the burden of creating documentation. I welcome comments and discussion on the strategy of documentation and the direction of implementation.

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

Econ 140

Econ 140

Visual Pruner: A Shiny app for cohort selection in observational studies

**Moderators**

**Speakers**

Observational studies are a widely used and challenging class of studies. A key challenge is selecting a study cohort from the available data, or "pruning" the data, in a way that produces both sufficient balance in pre-treatment covariates and an easily described cohort from which results can be generalized. Although many techniques for pruning exist, it can be difficult for analysts using these methods to see how the cohort is being selected. Consequently, these methods are underutilized in research. Visual Pruner is a free, easy-to-use Shiny web application that can improve both the credibility and the transparency of observational studies by letting analysts use updatable linked visual displays of estimated propensity scores and important baseline covariates to refine inclusion criteria. By helping researchers see how the pre-treatment covariate distributions in their data relate to the estimated probabilities of treatment assignment (propensity scores), the app lets researchers make pruning decisions based on covariate patterns that are otherwise hard to discover. The app yields a set of inclusion criteria that can be used in conjunction with further statistical analysis in R or any other statistical software. While the app is interactive and allows iterative decision-making, it can also easily be incorporated into a reproducible research workflow. Visual Pruner is currently hosted by the Vanderbilt Department of Biostatistics and can also be run locally within R or RStudio. For links and additional resources, see http://biostat.mc.vanderbilt.edu/VisualPruner.

Tuesday June 28, 2016 5:57pm - 6:15pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Welcome Reception

The welcome reception, sponsored by RStudio, will take place at the Bing Concert Hall on Stanford campus.

See social program for additional details.

See social program for additional details.

Tuesday June 28, 2016 6:30pm - 8:30pm PDT

Bing Concert Hall

Bing Concert Hall

Towards a grammar of interactive graphics

**Moderators**
## Karthik Ram

**Speakers**

I announced ggvis in 2014, but there has been little progress on it since. In this talk, I'll tell you a little bit about what I've been working on instead (data ingest, purrr, multiple models, ...) and tell you my plans for the future of ggvis. The goal is for 2016 to be the year of ggvis, and I'm going to be putting a lot of time into ggvis until it's a clear replacement for ggplot2. I'll talk about some of the new packages that will make this possible (including ggstat, ggeom, and gglayout), and how this work is also going to improve ggplot2.

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Wednesday June 29, 2016 9:00am - 10:00am PDT

McCaw Hall

McCaw Hall

Empowering Business Users with Shiny

**Moderators**
## Joseph Rickert

**Speakers**

Relationships between data scientists and business users can often be very transactional in nature (i.e. give us some data and we’ll give you a solution). This approach to analytics can produce meaningful results but removes business users from the analytical process, which often hinders adoption and prevents user insight from enhancing the analysis.n Shiny is a powerful tool that can be used to create compelling output from analytic work but it can also be used to cultivate interactive feedback loops between data scientists and business users. These feedback loops help ensure that data scientists are answering the right questions and that business users are given the opportunity to invest themselves in the analysis, which often expedites the execution and adoption of the data science work. The iterative development of these Shiny applications also works well within the agile framework that is becoming common for data science projects.n In this talk we will discuss some examples of how Shiny has been used at Allstate to empower business users and create an organizational appetite for data science.

Program Manager, Microsoft

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

Wednesday June 29, 2016 10:30am - 10:35am PDT

SIEPR 120

SIEPR 120

Importing modern data into R

**Moderators**

**Speakers**
## Javier Luraschi

This talk explores modern trends in data storage formats and the tools, packages and best practices to import this data into R. We will start with a quick recap of the existing tools and packages for importing data into R: readr, readxl, haven, jsonlite, xml2, odbc and jdbc. Afterwards, we will discuss modern data formats and the emerging tools we can use today. We will explore sparkr, mongolite and the role of specialized packages like fitbitScraper and getSymbols. This talk will wrap up by assessing gaps and exploring future trends in this space.

Software Engineer, RStudio

Javier is the author of “Mastering Spark with R”, sparklyr, mlflow, pins and many other R packages for deep learning and data science. He holds a double degree in Math and Software Engineer and decades of industry experience with a focus on data analysis. He currently works in... Read More →

Wednesday June 29, 2016 10:30am - 10:48am PDT

SIEPR 130

SIEPR 130

Notebooks with R Markdown

**Moderators**
## Karthik Ram

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

**Speakers**

Notebook interfaces for data analysis have compelling advantages including the close association of code and output and the ability to intersperse narrative with computation. Notebooks are also an excellent tool for teaching and a convenient way to share analyses. As an authoring format, R Markdown bears many similarities to traditional notebooks like Jupyter and Beaker, but it has some important differences. R Markdown documents use a plain-text representation (markdown with embedded R code chunks) which creates a clean separation between source code and output, is editable with the same tools as for R scripts (.Rmd modes are available for Emacs, Vim, Sublime, Eclipse, and RStudio), and works well with version control. R Markdown also features a system of extensible output formats that enable reproducible creation of production-quality output in many formats including HTML, PDF, Word, ODT, HTML5 slides, Beamer, LaTeX-based journal articles, websites, dashboards, and even full length books. In this talk we'll describe a new notebook interface for R that works seamlessly with existing R Markdown documents and displays output inline within the standard RStudio .Rmd editing mode. Notebooks can be published using the traditional Knit to HTML or PDF workflow, and can also be shared with a compound file that includes both code and output, enabling readers to easily modify and re-execute the code. Building a notebook system on top of R Markdown carries forward its benefits (plain text, reproducible workflow, and production quality output) while enabling a richer, more literate workflow for data analysis.

co-founder, rOpenSci

Wednesday June 29, 2016 10:30am - 10:48am PDT

McCaw Hall

McCaw Hall

OPERA: Online Prediction by ExpeRts Aggregation

**Moderators**

**Speakers**

We present an R package for prediction of time series based on online robust aggregation of a finite set of forecasts (machine learning method, statistical model, physical model, human expertise, ...). More formally, we consider a sequence of observations y(1), …, y(t), to be predicted element by element. At each time instance t, a finite set of experts provide prediction x(k,t) of the next observation y(t). Several methods are implemented to combine these expert forecasts according to their past performance (several loss functions are implemented to measure it). These combining methods satisfy robust finite time theoretical performance guarantees. We demonstrate on different examples from energy markets (electricity demand, electricity prices, solar and wind power time series) the interest of this approach both in terms of forecasting performance and time series analysis.

Wednesday June 29, 2016 10:30am - 10:48am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Predicting individual treatment effects

**Moderators**
## Susan Holmes

**Speakers**

Treatments for complicated diseases often help some patients but not all and predicting the treatment effect of new patients is important in order to make sure every patient gets the best possible treatment. We propose model-based random forests as a method to detect similarities between patients with respect to their treatment effect and on this basis compute personalized models for new patients to obtain their individual treatment effect. The whole procedure focuses on a base model which usually contains the treatment indicator as a single covariate and takes the survival time or a health or treatment success measurement as primary outcome. This base model is used to grow the model-based trees within the forest as well as to compute the personalized models, where the similarity measurements enter as weights. We show how personalized models can be set up using the cforest() and predict.cforest() functions from the "partykit" package in combination with regression models such as glm() ("stats") or survreg() ("survival"). We apply the methods to patients suffering from Amyotrophic Lateral Sclerosis (ALS). The data are publicly available from https://nctu.partners.org/ProACT and data preprocessing can be done with the R package "TH.data". The treatment of interest is the drug Riluzole which is the only approved drug against ALS but merely shows minor benefit for patients. The personalized models suggest that some patients benefit more from the drug than others.

Professor, Statistics, Stanford

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

Wednesday June 29, 2016 10:30am - 10:48am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Profvis: Profiling tools for faster R code

As programming languages go, R has a bit of a reputation for being slow. This reputation is mostly undeserved, and it hinges on the fact that R's copy-on-modify semantics make its performance characteristics different from other many other languages. That said, even the most expert R programmers often write code that could be faster. The first step to making code faster is to find which parts are slow. This isn't an easy task. Sometimes we have no idea what parts of code are expensive, and even when we do have intuitions about it, those intuitions can be wrong. After the slow parts of code have been identified, one can move on to the next step: speeding up that code. In this talk I'll show how to profile and optimize code using profvis, a new package for exploring profiling data. Profvis provides a graphical interface that makes it easy to spot which pieces of code are expensive. I will also discuss why some common operations in R may be surprisingly slow, and how they can be sped up.

Wednesday June 29, 2016 10:30am - 10:48am PDT

Econ 140

Econ 140

Bespoke eStyle Statistical Training for Africa: challenges and opportunities of developing an online course

**Moderators**
## Joseph Rickert

**Speakers**

The development of ‘BeST’ an online course for African scientists and early career researchers aimed to provide support for experimental design principles and the use of R software. It is available at yieldingresults.org and is supported by the Australian Centre for International Research (ACIAR). A team of developers produced materials with an emphasis on visual and practical materials. The site is continuously available and is in modular format and aims to assist in developing designs and following through with analysis and reporting. The early evaluation by clients and students will be presented. Options and challenges for future support and collaboration of the site will be discussed.f

Program Manager, Microsoft

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

Wednesday June 29, 2016 10:35am - 10:40am PDT

SIEPR 120

SIEPR 120

Tie-ins between R and Openstreetmap data

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**
## Jan-Philipp Kolb

An abundance of information emerged through collaborative mapping as a consequence of the development of Openstreetmap (OSM) in 2004. Currently, all kinds of R-packages are available to deal with different types of spatial data. But getting the OSM data into the R-environment can still be challenging, especially for users who are new to R. One way to access information is the usage of application programming interfaces (APIs) like the Overpass API.n nIn this presentation, I will focus on the possibilities to access, assess and process OSM-data with R. Therefore, I will provide the tie-ins of the R-language and OSM. Since XML-protocols are often used to describe spatial information, we will use the package geosmdata, which is a wrapper to transfer such information to R-dataframes, using the XML-package. Furthermore, I will showcase the importance of such connections via a brief case-study related to social sciences.

Program Manager, Microsoft

Senior Researcher, Gesis Leibniz Institute for the social sciences

Data Science, spatial data and the usage in social sciences

Wednesday June 29, 2016 10:40am - 10:45am PDT

SIEPR 120

SIEPR 120

MAVIS: Meta Analysis via Shiny

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**
## W. Kyle Hamilton

We present a Shiny (RStudio & Inc., 2014) web application and R (R Core Team, 2015) package to simplify the process of running a meta-analysis using a variety of packages from the R community, including the popular metafor package (Viechtbauer, 2010). MAVIS (Hamilton, Aydin, and Mizumoto, 2014) was created to be used as a teaching tool for students and for scientists looking to run their own meta-analysis. Currently MAVIS supports both fixed and random effects models, methods for detecting publication bias, effect size calculators, single case design support, and generation of publication grade graphics. With this application we’ve created an open source browser based graphical user interface (GUI) which has lowered the barrier of entry for novice and occasional users.

Program Manager, Microsoft

Graduate Student, University of California, Merced

Kyle is a graduate student in the health psychology program at the University of California, Merced. His research interests include emotion regulation, electronic cigarettes, agent-based modeling, and human-computer interaction.
Kyle has written a few Shiny applications which can... Read More →

Wednesday June 29, 2016 10:45am - 10:50am PDT

SIEPR 120

SIEPR 120

ETL for medium data

Packages provide users with software that extends the core functionality of R, as well as data that illustrates the use of that functionality. However, by design the type of data that can be contained in an R package on CRAN is limited. First, packages are designed to be small, so that the amount of data stored in a package is supposed to be less than 5 megabytes. Furthermore, these data are static, in that CRAN allows only monthly releases. Alternative package repositories -- such as GitHub -- are also limited in their ability to store and deliver data that could be changing in real-time to R users. The etl package provides a CRAN-friendly framework that allows R users to work with medium data in a responsible and responsive manner. It leverages the dplyr package to facilitate Extract-Load-Transfer (ETL) operations that bring real-time data into local or remote databases controllable by R users who may have little or no SQL experience. The suite of etl-dependent packages brings the world of medium data -- too big to store in memory, but not so big that it won't fit on a hard drive -- to a much wider audience.

Wednesday June 29, 2016 10:48am - 11:06am PDT

SIEPR 130

SIEPR 130

Meta-Analysis of Epidemiological Dose-Response Studies with the dosresmeta R package

**Moderators**
## Susan Holmes

**Speakers**

Quantitative exposures (e.g. smoking, alcohol consumption) in predicting binary health outcomes (e.g. mortality, incidence of a disease) are frequently categorized and modeled with indicator variables. Results are expressed as relative risks for the levels of exposure using one category as referent. Dose-response meta-analysis is an increasing popular statistical technique that aims to estimate and characterize an overall functional relation from such aggregated data. A common approach is to contrast the outcome risk in the highest exposure category relative to the lowest. A dose-response approach is more robust since it takes into account the quantitative values associated with the exposure categories. It provides a detailed description of how the risk varies throughout the observed range of exposure. Additionally, since all the exposure categories contribute to determine the overall relation, estimation is more efficient. Our aim is to give a short introduction to the methodological framework (structure of aggregated data, covariance of correlated outcomes, estimation and pooling of individual curves). We describe how to test hypothesis and how to quantify statistical heterogeneity. Alternative tools to flexibly model the quantitative exposure will be presented (splines and polynomials). We will illustrate modelling techniques and presentation of (graphical and tabular) results using the dosresmeta R package.

Professor, Statistics, Stanford

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

Wednesday June 29, 2016 10:48am - 11:06am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

permuter: An R package for randomization inference

**Moderators**

**Speakers**

Software packages for randomization inference are few and far between. This forces researchers either to rely on specialized stand-alone programs or to use classical statistical tests that may require implausible assumptions about their data-generating process. The absence of a flexible and comprehensive package for randomization inference is an obstacle for researchers from a wide range of disciplines who turn to R as a language for carrying out their data analysis. We present permuter, a package for randomization inference. We illustrate the program's capabilities with several examples:

- a randomized experiment comparing the student evaluations of teaching for male and female instructors (MacNell et. al, 2014)

- a study of the association between salt consumption and mortality at the level of nations

- an assessment of inter-rater reliability for a series of labels assigned by multiple raters to video footage of children on the autism spectrum

We discuss future plans for permuter and the role of software development in statistics.

- a randomized experiment comparing the student evaluations of teaching for male and female instructors (MacNell et. al, 2014)

- a study of the association between salt consumption and mortality at the level of nations

- an assessment of inter-rater reliability for a series of labels assigned by multiple raters to video footage of children on the autism spectrum

We discuss future plans for permuter and the role of software development in statistics.

Wednesday June 29, 2016 10:48am - 11:06am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Size of Datasets for Analytics and Implications for R

**Moderators**

**Speakers**
## Szilard Pafka

With so much hype about "big data" and the industry pushing for distributed computing vs traditional single-machine tools, one wonders about the future of R. In this talk I will argue that most data analysts/data scientists don't actually work with big data the majority of the time, therefore using immature "big data" tools is in fact counter productive. I will show that contrary to widely-spread believes, the increase of dataset sizes used for analytics has been actually outpaced in the last 10 years by the increase in memory (RAM), making the use of single-machine tools ever more attractive. Furthermore, base R and several widely used R packages have undergone significant performance improvements (I will present benchmarks to quantify this), making R the ideal tool for data analysis on even relatively large datasets. In particular, R has access (via CRAN packages) to excellent high-performance machine learning libraries (benchmarks will be presented), while high-performance and parallel computing facilities have been part of the R ecosystem for many years. Nevertheless, the R community shall of course continue pushing the boundaries and extend R with new and ever more performant features.

Chief Data Scientist, Epoch

Szilard studied Physics in the 90s and has obtained a PhD by using statistical methods to investigate the risk of financial portfolios. Next he has worked in a bank quantifying and managing market risk. About a decade ago he moved to California to become the Chief Scientist of a credit... Read More →

Wednesday June 29, 2016 10:48am - 11:06am PDT

Econ 140

Econ 140

Visualizing Simultaneous Linear Equations, Geometric Vectors, and Least-Squares Regression with the matlib Package for R

**Moderators**
## Karthik Ram

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

**Speakers**

The aim of the matlib package is pedagogical --- to help teach concepts in linear algebra, matrix algebra, and vector geometry that are useful in statistics. To this end, the package includes various functions for numerical linear algebra, most of which duplicate capabilities available elsewhere in R, but which are programmed transparently and purely in R code, including functions for solving possibly over- or under-determined linear simultaneous equations, for computing ordinary and generalized matrix inverses, and for producing various matrix decompositions. Many of these methods are implemented via Gaussian elimination. This paper focuses on the visualization facilities in the matlib package, including for graphing the solution of linear simultaneous equations in 2 and 3 dimensions; for demonstrating vector geometry in 2 and 3 dimensions; and for displaying the vector geometry of least-squares regression. We illustrate how these visualizations help to communicate fundamental ideas in linear algebra, vector geometry, and statistics. The 3D visualizations are implemented using the rgl package.

co-founder, rOpenSci

Wednesday June 29, 2016 10:48am - 11:06am PDT

McCaw Hall

McCaw Hall

madness: multivariate automatic differentiation in R

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

The madness package provides a class for automatic differentiation of `multivariate' operations via forward accumulation. `Multivariate' means the class computes the derivative of a vector or matrix or multidimensional array (or scalar) with respect to a scalar, vector, matrix, or multidimensional array. The primary intended use of this class is to support the multivariate delta method for performing inference on multidimensional quantities.

Program Manager, Microsoft

Wednesday June 29, 2016 10:50am - 10:55am PDT

SIEPR 120

SIEPR 120

rempreq: An R package for Estimating the Employment Impact of U.S. Domestic Industry Production and Imports

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

The impact of imports and technological change on domestic employment is a long-term and ongoing topic for academic and government research, and discussion in the popular media[1][2].nnThe U.S. Bureau of Labor Statistics (BLS) publishes a current and historical Employment Requirements Matrix (ERM), which details the employment generated directly and indirectly across all industries by a million dollars production of a given industry's primary product[3]. The BLS data can give an indication of the relative impact of different industries' primary production, and is broken down by years (1997-2014), over 200 sectors, and by domestic-only versus total production including imports. The ERM is often used in research as a component of general Leontief Input-Ouput models, and external sources of economic data for final demand.[4] R, with it's support for input-output modelling and general matrix operations, is well-suited to research in this area[5][6].nnThe package rempreq includes both the current and historic tables, for total production and domestic production only. It also includes functions for accessing any particular year (or years), selecting industries, and for including domestic versus total output. These can be used to conveniently generate estimated time series for the employment impact for various types of production, the impact of imports on employment, and investigate changes in the technological structure of industries related to employment in those industries over time.nnThis presentation will include an introduction to rempreq, sample demonstrations of its use, and future plans for the extension of the package.

Program Manager, Microsoft

Wednesday June 29, 2016 10:55am - 11:00am PDT

SIEPR 120

SIEPR 120

Text Mining and Sentiment Extraction in Central Bank Documents

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

The deep transformation induced by the World Wide Web (WWW) revolution has thoroughly impactedna relevant part of the social interactions in our present global society. The huge amount ofnunstructured information available on blogs, forum and public institution web sites puts forward differentnchallenges and opportunities. Starting from these considerations, in this paper we pursue a two-foldngoal. Firstly we review some of the main methodologies employed in text mining and for the extractionnof sentiment and emotions from textual sources. Secondly we provide an empirical application by consideringnthe latest 20 issues of the Bank of Italy Governor’s concluding remarks from 1996 to 2015. Byntaking advantage of the open source software package R, we show the following:n1. checking the word frequency distribution features of the documents;n2. extracting the evolution of the sentiment and the polarity orientation in the texts;n3. evaluating the evolution of an index for the readability and the formality level of the texts;n4. attempting to measure the popularity gained from the documents in the web.nThe results of the empirical analysis show the feasibility in extracting the main topics from the consideredncorpus. Moreover it is shown how to check for positive and negative terms in order to gauge thenpolarity of statements and whole documents. The R employed packages have proved suitablenand comprehensive for the required tasks. Improvements in the documentation and the package arrangement are suggested for increasing the usability.

Program Manager, Microsoft

Wednesday June 29, 2016 11:00am - 11:05am PDT

SIEPR 120

SIEPR 120

Maximum Monte Carlo likelihood estimation of conditional auto-regression models

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

Likelihood of conditional auto-regression (CAR) models is expensive to compute even for a moderate data size around 1000 and it is usually not in closed form with latent variables. In this work we approximate the likelihood by Monte Carlo methods and propose two algorithms for optimising the Monte Carlo likelihood. The algorithms search for the maximum of the Monte Carlo likelihood and by taking the Monte Carlo error into account, the algorithms appear to be stable regardless the initial parameter value. Both algorithms are implemented in R and the iterative procedures are fully automatic with user-specified parameters to control the Monte Carlo simulation and convergence criteria.nnWe first demonstrate the use of the algorithms by simulated CAR data on a $20 \times 20$ torus. Then methods were applied to a data from forest restoration experiment with around 7000 trees arranged in transects in study plots. The growth rate of trees was modelled by a linear mixed effect model with CAR spatial error and CAR random effects. A approximation to the MLE was found by our proposed algorithms in a reasonable computational time.

Program Manager, Microsoft

Wednesday June 29, 2016 11:05am - 11:10am PDT

SIEPR 120

SIEPR 120

An embedded domain-specific language for ODE-based drug-disease modeling and simulation

**Moderators**
## Susan Holmes

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

**Speakers**

We present a domain-specific mini-language embedded in R for expressing pharmacometrics and drug-disease models. Key concepts in our RxODE implementation include A simple syntax for specifying models in terms of ordinary differential equations (ODE). A compilation manager to translate, compile, and load machine code into R for fast execution. An 'eventTable' closure object to express inputs/perturbations into the underlying dynamic system being modeled. Model reflectance to describe a model's structure, parameters, and derived quantities (useful for meta-programming, e.g., automatic generation of shiny applications). We present examples in the design of complex drug dosing regimens for first-in-human studies via simulations, the modeling of unabated Alzheimer disease progression, and time-permitting, the modeling of visual acuity among age-related macular degeneration patients in the presence of disease-mitigating therapies. We also compare our approach in RxODE to similar work, namely, deSolve, mrgsolve, mlxR, nlmeODE, and PKPDsim. In closing, we use this 40th anniversary of the S language to reflect on the remarkably solid modeling framework laid out almost 25 years ago in "Statistical Models in S" (Chambers and Hastie (1992)), and to identify new challenges for specifying and fitting increasingly more complex statistical models, such as models of dynamic systems (as above), models for multi-state event history analysis, Bayesian data analysis, etc.

Professor, Statistics, Stanford

Wednesday June 29, 2016 11:06am - 11:24am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Calculation and economic evaluation of acceptance sampling plans

**Moderators**

**Speakers**

Sampling inspection is one of the quality control tools used in industry to help keep the quality of the products at satisfactory level while at the same time having the cost in control. When using acceptance sampling inspection, a decision on whether the lot of items is to be accepted or rejected is based on results of inspecting a sample of items from the lot. Acceptance sampling plans which minimize the mean inspection cost per lot of the process average quality when the remainder of rejected lots is inspected were originally designed by Dodge and Romig for the inspection by attributes. Sampling plans for the inspection by variables were then proposed and it has been shown that such plans may be more economical than the corresponding attributes sampling plans. We recall the calculation and economic performance evaluation of the variables sampling plans, show how further improvements in inspection cost could be achieved using EWMA-based statistic and we comment on some of the possibilities available for calculation and evaluation of the plans in R extension package LTPDvar.

Wednesday June 29, 2016 11:06am - 11:24am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Efficient tabular data ingestion and manipulation with MonetDBLite

**Moderators**

**Speakers**

We present "MonetDBLite", a new R package containing an embedded version of MonetDB. MonetDB is a free and open source relational database focused on analytical applications. MonetDBLite provides fast complex query answers and unprecedented speeds for data availability and data transfer to and from R. MonetDBLite greatly simplifies database installation, setup and maintenance. It is installed like any R package, and the database fully runs inside the R process. This has the crucial advantage of data transfers between the database and R being very fast. Another advantage is MonetDBLite's fast startup with existing data sets. MonetDBLite will store tables as files on disk, and can reload from these regardless of their size. This enables R scripts to very quickly start processing data instead of loading from, e.g., a CSV file every time. MonetDBLite leverages our previous work on mapping database operations into R (now achieved through dplyr in the MonetDB.R package) as well as previous work on ad-hoc user defined functions for MonetDB with R. The talk will introduce the package, demonstrate its installation, and showcase a real-world statistical data analysis on the Home Mortgage Disclosure Act (HMDA) dataset. We show how MonetDBLite compares with its (partial) namesake SQLite and other relational databases. We will demonstrate that for statistical analysis workloads, MonetDBLite easily outperforms these previous systems, effectively allowing analysis of larger datasets on desktop hardware. MonetDBLite has been submitted to CRAN and will hopefully be accepted by useR! 2016.

Wednesday June 29, 2016 11:06am - 11:24am PDT

SIEPR 130

SIEPR 130

On the emergence of R as a platform for emergency outbreak response

**Moderators**
## Karthik Ram

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

**Speakers**
## Thibaut Jombart

The recent Ebola virus disease outbreak in West Africa has been a terrible reminder of the necessities of rapid evaluation and response to emerging infectious disease threats. For such response to be fully informed, complex epidemiological data including dates of symptom onsets, locations of the cases, hospitalisation, contact tracing information and pathogen genome sequences have to be analysed in near real time. Integrating all these data to inform Public Health response is a challenging task, which typically involves a variety of visualisation tools and statistical approaches. However, a unified platform for outbreak analysis has been lacking so far. Some recent collaborative efforts, including several international hackathons, have been made to address this issue. This talk will provide an overview of the current state of R as a platform for the analysis of disease outbreaks, with an emphasis on lessons learnt from a direct involvement with the recent Ebola outbreak response.

co-founder, rOpenSci

Lecturer, Imperial College London

About me
-------------
Biostatistician working on disease outbreak response, pathogen population genetics, phylogenetics, and some random other things; R developer of adegenet, adephylo, apex, bmmix, dibbler, geoGraph, outbreaker(2), treescape, vimes, and a few more shameful projects... Read More →

Wednesday June 29, 2016 11:06am - 11:24am PDT

McCaw Hall

McCaw Hall

Resource-Aware Scheduling Strategies for Parallel Machine Learning R Programs though RAMBO

**Moderators**

**Speakers**

We present resource-aware scheduling strategies for parallel R programs leading to efficient utilization of parallel computer architectures by estimating resource demands. We concentrate on applications that consist of independent tasks. The R programming language is increasingly used to process large data sets in parallel, which requires a high amount of resources. One important application is parameter tuning of machine learning algorithms where evaluations need to be executed in parallel to reduce runtime. Here, resource demands of tasks heavily vary depending on the algorithm configuration. Running such an application in a naive parallel way leads to inefficient resource utilization and thus to long runtimes. Therefore, the R package “parallel” offers a scheduling strategy, called “load balancing”. It dynamically allocates tasks to worker processes. This option is recommended when tasks have widely different computation times or if computer architectures are heterogeneous. We analyzed memory and CPU utilization of parallel applications with our TraceR profiling tool and found that the load balancing mechanism is not sufficient for parallel tasks with high variance in resource demands. A scheduling strategy needs to know resource demands of a task before execution to efficiently map applications to available resources. Therefore, we build a regression model to estimate resource demands based on previous evaluated tasks. Resource estimates like runtime are then used to guide our scheduling strategies. Those strategies are integrated in our RAMBO (Resource-Aware Model-Based Optimization) Framework. Compared to standard mechanisms of the parallel package our approach yields improved resource utilization.

Wednesday June 29, 2016 11:06am - 11:24am PDT

Econ 140

Econ 140

Scaling R for Business Analytics

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

There’s no question that R is the fastest growing analytic language amongst data miners and data scientists. Organizations are also embracing R for business analytics to attract the new generation of analytic talent entering into the industry. However data and processing limitations associated with R become a real challenge as analyst wrestle with billions of records and analyze complex relationships, while working with new data sources to enhance business analytic solutions. Vendors are addressing this challenge with parallel R technology and claim to “lift all limitations of R”, but no data platform will “auto-magically” scale R. This session drills into the different ways to scale R and its benefits and challenges. The take-away from this session is a set of questions that can be used to evaluate scalable R technologies to align with your business requirements.

Program Manager, Microsoft

Wednesday June 29, 2016 11:10am - 11:15am PDT

SIEPR 120

SIEPR 120

Building a High Availability REST API Engine for R

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

Modern businesses require APIs that have rock solid uptime, where deploying a new version never drops a request, where you can promote and roll back versions, and that perform with low latency and high throughput. Domino has built our R API endpoint functionality leveraging open source tools such as nginx and tresletech’s plumber package, to support modern data science teams desire to reduce time from modeling to productionalization. In this talk, we discuss lessons we have learned building this functionality using the R ecosystem. We describe some of the technical challenges building such a platform, and some best practices for researchers who want to make their R models easily deployable as APIs. Domino’s technology has served millions of requests for clients ranging from online media to energy companies. We will tell you how we did it.

Program Manager, Microsoft

Wednesday June 29, 2016 11:15am - 11:20am PDT

SIEPR 120

SIEPR 120

Automated clinical research tracking and assessment using R-Shiny

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

Many database tools enjoying widespread use in academic medicine, such as REDCap, provide only limited facilities for study monitoring built-in. They often do, however, provide the ability for analysts to interact with data and metadata via Application Program Interfaces (APIs). This offers the possibility of automated monitoring and web-based reporting via the use of external tools and real-time interaction with the API. nnIn this presentation we provide a framework for automated data quality and participant enrollment monitoring in clinical research using R Shiny Server. To illustrate, we demonstrate linkage of R Shiny to a REDCap database via its API, streamlining the process of data collection, fetching, manipulation and display in one process. Automated display of to-the-minute enrollment, protocol adherence, and descriptions of enrolled samples, are provided in a format consistent with NIH reporting requirements. Additionally, we demonstrate the flexibility of this system in providing interactivity to address investigator queries in real-time; for instance, investigators may select sample subgroups to display, or inquire as to the risk profile of an individual with an extreme baseline measurement on an important screening variable. This approach offers the potential to greatly increase efficiency and maintenance of data quality in research studies.

Program Manager, Microsoft

Wednesday June 29, 2016 11:20am - 11:25am PDT

SIEPR 120

SIEPR 120

Efficient in-memory non-equi joins using data.table

A join operation combines two (or more) tables on some shared columns based on a condition. An equi-join is a case where this combination condition is defined by the binary operator $==$. It is a special type of $\theta$-join which consists of the entire set of binary operators: {=, ==}. This talk presents the recent developments in the data.table package to extend its equi-join functionality to any/all of these binary operators very efficiently. For example, X[Y, on = .(X.a >= Y.a, X.b Y.a, X.b < Y.a)] performs a range join. Many databases are fully capable of performing both equi and non-equi joins. R/Bioconductor packages IRanges and GenomicRanges contain efficient implementations for dealing with interval ranges alone. However, so far, there are no direct in-memory R implementations of non-equi joins that we are aware of. We believe this is an extremely useful feature that a lot of R users can benefit from.

Wednesday June 29, 2016 11:24am - 11:42am PDT

SIEPR 130

SIEPR 130

Grid Computing in R with Easy Scalability

**Moderators**

**Speakers**

Parallel computing is useful for speeding up computing tasks and many R packages exist to aid in using parallel computing. Unfortunately it is not always trivial to parallelize jobs and can take a significant amount of time to accomplish, time that may be unavailable. My presentation will demonstrate an alternative method that allows for processing of multiple jobs simultaneously across any number of servers using Redis message queues. This method has proven very useful since I began implementing it at my company over two years ago. In this method, a main Redis server handles communication with any number of R processes on any number of servers. These processes, known as workers, inform the server that they are available for processing and then wait indefinitely until the server passes them a task. In this presentation, it will be demonstrated how trivial it is to scale up or down by adding or removing workers. This will be demonstrated with sample jobs run on workers in the Amazon cloud. Additionally, this presentation will show you how to implement such a system yourself with the rminions package I have been developing. This package is based on what I have learned over the past couple of years and contains functionality to easily start workers, queue jobs, and even perform R-level maintenance (such as installing packages) on all connected servers simultaneously!

Wednesday June 29, 2016 11:24am - 11:42am PDT

Econ 140

Econ 140

Heatmaps in R: Overview and best practices

**Moderators**

**Speakers**

A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

This talk will provide an overview of R functions and packages for creating useful and beautiful heatmaps. Attention will be given to data pre-processing, choosing colors for the data-matrix via {viridis}, producing thoughtful dendrograms using {dendextend} and {colorspace}, while ordering the rows and columns with {DendSer} (and {seriation}). The talk will cover both static as well as the newly available interactive plotting engines using packages such as {gplots}, {d3heatmap}, {ggplot2} and {plotly}.

The speaker is the author of the {dendextend} R package, a co-author of the {d3heatmap} package, and blogs at www.r-statistics.com.

This talk will provide an overview of R functions and packages for creating useful and beautiful heatmaps. Attention will be given to data pre-processing, choosing colors for the data-matrix via {viridis}, producing thoughtful dendrograms using {dendextend} and {colorspace}, while ordering the rows and columns with {DendSer} (and {seriation}). The talk will cover both static as well as the newly available interactive plotting engines using packages such as {gplots}, {d3heatmap}, {ggplot2} and {plotly}.

The speaker is the author of the {dendextend} R package, a co-author of the {d3heatmap} package, and blogs at www.r-statistics.com.

Wednesday June 29, 2016 11:24am - 11:42am PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Network Diffusion of Innovations in R: Introducing netdiffuseR

**Moderators**
## Karthik Ram

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

**Speakers**

The Diffusion of Innovations theory, while one of the oldest social science theories, has embedded and flowed in its popularity over its 100 year or so history. In contrast to contagion models, diffusion of innovations can be more complex since adopting an innovation usually requires more than simple exposure to other users. At the same time, although computational tools for data collection, analysis, and network research have advanced considerably with little parallel develop of diffusion network models. To address this gap, we have created the netdiffuseR R package. The netdiffuseR package implements both classical and novel diffusion of innovations models, visualization methods, and data-management tools for the statistical analysis of network diffusion data. The netdiffuseR package goes further by allowing researchers to analyze relatively large datasets in a fast and reliable way, extending current network analysis methods for studying diffusion, thus serving as a great complement to other popular network analysis tools such as igraph, statnet or RSiena. netdiffuseR can be used with new empirical data, with simulated data, or with existing empirical diffusion network datasets.

co-founder, rOpenSci

Wednesday June 29, 2016 11:24am - 11:42am PDT

McCaw Hall

McCaw Hall

The phangorn package: estimating and comparing phylogenetic trees

**Moderators**
## Susan Holmes

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

**Speakers**

Methods of phylogenetic reconstruction are nowadays frequently used outside computational biology like in linguistics and in form of hierarchical clustering in many other disciplines. The R package phangorn allows to reconstruct phylogenies using Maximum Likelihood, Maximum Parsimony or distanced based methods. The package offers many functions to compare trees through visualization (splits networks, lento plot, densiTree) and to choose and compare statistical models (e.g. modelTest, SH-test, parametric bootstrap). phangorn is closely connected with other phylogenetic R packages ape or phytools in the field of phylogenetic (comparative) methods.

Professor, Statistics, Stanford

Wednesday June 29, 2016 11:24am - 11:42am PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

FirebrowseR an 'API' Client for Broads 'Firehose' Pipeline

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

The Cancer Genome Atlas is one of the most valuable resources for modern cancer research. One of the major projects for processing and analysing its data is the Firehose Pipeline, provided the Broad Institute. The pre-processed and analysed data of this pipeline is made available through the Firebrowse website (http://firebrowse.org/) and a RESTful API, to download such data sets. FirebrowseR is an R client to connect and interact with this API, to directly download and import the request data sets into R. Using FirebrowseR, only requested data sets are download from the API, reducing the overhead for downloading; compared to the classic download options, such as CSF, MAF or compressed files. Further FirebrowseR capsules the provided data sets into a standardised format (a data.frame or JSON object), making the steps of data wrangling and importing needless.

Program Manager, Microsoft

Wednesday June 29, 2016 11:25am - 11:30am PDT

SIEPR 120

SIEPR 120

Performance Above Random Expectation: A more intuitive and versatile metric for evaluating probabilistic classifiers

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**
## Stephen R Piccolo

Many classification algorithms generate probabilistic estimates of whether a given sample belongs to a given class. Various scoring metrics have been developed to assess the quality of such probabilistic estimates. In many domains, the area under the receiver-operating-characteristic curve (AUC) is predominantly used. When applied to two-class problems, the AUC can be interpreted as the frequency at which two randomly selected samples are ranked correctly, according to their assigned probabilities. As its name implies, the AUC is derived from receiver-operating-characteristic (ROC) curves, which illustrate the relationship between the true positive rate and false positive rate. However, ROC curves—which have their roots in signal processing—are difficult for many people to interpret. For example, in medical settings, ROC curves can identify the probability threshold that achieves an optimal balance between over- and under-diagnosis for a particular disease; yet it is unintuitive to evaluate such thresholds visually. I have developed a scoring approach, Performance Above Random Expectation (PARE), which assesses classification accuracy at various probability thresholds and compares it against the accuracy obtained with random class labels. Across all thresholds, this information can be summarized as a metric that evaluates probabilistic classifiers in a way that is qualitatively equivalent to the AUC metric. However, because the PARE method uses classification accuracy as its core metric, it is more intuitively interpretable. It can also be used to visually identify a probability threshold that maximizes accuracy—thus effectively balancing true positives with false positives. This method generalizes to various other applications.

Program Manager, Microsoft

Assistant Professor, Brigham Young University

Bioinformatics, machine learning, genomics, human health

Wednesday June 29, 2016 11:30am - 11:35am PDT

SIEPR 120

SIEPR 120

R's Role in Healthcare Data: Exploration, Visualization and Presentation

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

Over the past six or seven years, the healthcare industry has been transformed through a broad shift away from paper-based workflows to electronic ones. This new electronic infrastructure has enabled the collection of vast amounts of clinical and financial data. Problematically, the ability to analyze such data and convert it into meaningful insights capable of driving continuous improvements has not kept pace with the speed of data acquisition.nnWith healthcare costs continuing to grow, national policy changes have recently placed a new mandate for healthcare providers to leverage data to reign in expenses. R, which has been honed and developed across a number of industries prior to healthcare, has proven to be an invaluable tool in addressing this challenge. The integrated data manipulation capabilities, focus on exploratory data analysis and deep integration with broader data analysis pipelines, including final end-user analytics and visualizations, have proven invaluable.nnThis presentation will highlight the growing role of R in healthcare analytics. Specifically, it will focus on:n* R's role in rapid data exploration, facilitated through R-Markdown, particularly around data sets where the exact questions are still being formulatedn* R's role in end-user-ready visualizations, with a focus on the ggplot packagen* R's role in crafting an explanatory narrative around investigatory analysis

Program Manager, Microsoft

Wednesday June 29, 2016 11:35am - 11:40am PDT

SIEPR 120

SIEPR 120

Chunked, dplyr for large text files

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers**

During a data analysis project it may happen that a new version of the raw data comes available or that data changes are made outside of your control. `daff` is a R package that helps to keep track of such changes. It can find differences in values between data.frames, store these differences, render them and apply them as a patch to a new data.frame. It can also merge two versions of a data.frame having a common parent version. It wraps the daff.js library of Paul Fitzpatrick (http://github.com/paulfitz/daff) using the V8 package.

Program Manager, Microsoft

Wednesday June 29, 2016 11:40am - 11:45am PDT

SIEPR 120

SIEPR 120

Classifying Murderers in Imbalanced Data Using randomForest

**Moderators**
## Karthik Ram

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

**Speakers**
## Jorge Alberto Miranda

In order to allocate resources more effectively with the goal of providing safer communities, R's randomForest algorithm was used to identify candidates who may commit or attempt murder. And while crime data within the general population may be highly imbalanced, one may expect the rate of murderers within a high-risk probationer population to be much less imbalanced. However, the County of Los Angeles had nearly 130 probationers commit or attempt murder out of nearly 17,000, a ratio close to 1:130). Classic methods were used to overcome class imbalance, including under/over stratified sampling and variable sampling per tree. The results were encouraging. Model validation tests demonstrate an 87% overall accuracy rate at relatively low costs. The agency currently uses a risk assessment tool that was outperformed by randomForest up to 52% (both in overall accuracy and a reduction in false positives). This work is based on research conducted by Berk, R. et al. (2009) originally published by Journal of the Royal Statistical Society.

co-founder, rOpenSci

Analyst, County of Los Angeles

I have been an R user since 2013 when I first started working with a data reporting team at the Los Angeles County Probation Department. One of my goals in life is to convert more of my colleagues into R users and make R part of the County toolkit. With nearly 100,000 employees, I... Read More →

Wednesday June 29, 2016 11:42am - 12:00pm PDT

McCaw Hall

McCaw Hall

Exploring the R / SQL boundary

**Moderators**

**Speakers**

Databases have a long history of delivering highly scalable solutions for storing, manipulating, and analyzing data, transaction processing and data warehousing, while R is the most widely used language for data analytics and machine learning due to its rich ecosystem of machine learning algorithms and data manipulation capabilities. But, when using these tools together, how do you decide how much processing to do in SQL before switching to R? In this talk, we will explore setting the R / SQL boundary under three scenarios: RODBC connections, dplyr data extractions, and in-database R processing, and examine the consequences of each of these approaches with respect to data exploration, feature engineering, modeling and predictions. We identify common performance killers such as excessive data movements and serial processing, and illustrate the techniques, with examples from both an open source database (Postgres) and a commercial database (Microsoft SQL Server).

Wednesday June 29, 2016 11:42am - 12:00pm PDT

SIEPR 130

SIEPR 130

Interactive Terabytes with pbdR

Historically, large scale computing and interactivity have been at odds. A new series of packages have recently been developed to attempt to rectify this problem. We do so by combining two programming models: client/server (CS) and single program multiple data (SPMD). The client/server allows the R programmer to control from one to thousands of batch servers running as cooperating remote instances of R. This can easily be done from a local R or RStudio session. The communication is handled by the well-known ZeroMQ library, with a new set of package bindings available to R by way of the pbdZMQ package. The client and server are implemented in the new remoter and pbdCS packages. To handle computations, we use the established pbdR packages for large scale distributed computing. These packaegs utilize HPC standards like MPI and ScaLAPACK to handle complex coupled computations on truly large data. These tools use the batch SPMD programming model, and constitute the server portion of the client/server hierarchy. So once the client issues a command, it is transmitted to the SPMD servers and executed in a massively parallel fashion.This talk will discuss the package components and provide timing results for some Terabyte size computations running on hundreds of cores of a cluster.

Wednesday June 29, 2016 11:42am - 12:00pm PDT

Econ 140

Econ 140

Multivoxel Pattern Analysis of fMRI Data

**Moderators**
## Susan Holmes

I like teaching nonparametric multivariate analyses to biologists.
Reproducible research is really important to me and I make all my work available online, mostly as Rmd files. I still like to code, use Github and shiny as well as Bioconductor. I am trying to finish a book for biologists... Read More →

**Speakers**

Analysis of functional magnetic resonance imaging (fMRI) data has traditionally been carried out by analyzing each voxel's time-series independently with a linear model. While this approach has been effective for creating statistical maps of brain activity, recent work has show that greater sensitivity to distributed neural signals can be achieved with multivariate approaches that analyze patterns of activity rather than methods that work only on voxel at a time. This has led to an explosion of interest in so-called "multivoxel pattern analysis" (MVPA) which is essentially the application of machine learning algorithms to neuroimaging data. The R programming environment is well-suited for MVPA analyses due to its large and varied support for statistical learning methods available on CRAN. Many of these methods are can be conveniently accessed using a standard interface provided by the 'caret' library. Here we present a new library (rMVPA) that makes MVPA analyses of fMRI data available to R users by leveraging the 'caret' and 'neuroim' packages. The rMVPA analyses implements multiple methods for multivariate analysis of fMRI data including the spherical searchlight method, region of interest analyses, and a new hierarchical ensemble approach to MVPA.

Professor, Statistics, Stanford

Wednesday June 29, 2016 11:42am - 12:00pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Phylogenetically informed analysis of microbiome data using adaptive gPCA in R

**Moderators**

**Speakers**

When analyzing microbiome data, biologists often use exploratory methods that take into account the relatedness of the bacterial species present in the data. This helps in the interpretability and stability of the analysis because phylogenetically related bacteria often have similar functions. However, we believe (and will demonstrate), that the methods currently in use put too much emphasis on the phylogeny when making the ordinations. To address this, we have developed a framework we call adaptive gPCA, which allows the user to specify the amount of weight given to the tree and which will automatically select an amount of weight to give to the tree. We have implemented this method in R and have made it easy to use with phyloseq, a popular R package for microbiome data storage and manipulation. Additionally, we have developed a shiny app that allows for interactive data visualization and comparison of the ordinations resulting from different weightings of the tree.

Wednesday June 29, 2016 11:42am - 12:00pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Automated risk calculation in clinical practice and research - the riskscorer package

**Moderators**
## Joseph Rickert

Joseph is a Program Manager at Microsoft having come to Microsoft with the acquisition of Revolution Analytics. He is a data scientist and R language evangelist passionate about analyzing data and teaching people about R. He is a regular contributor to the Revolutions blog and an... Read More →

**Speakers** *AM*
## Alexander Meyer

Clinical risk scores are important tools in therapeutic decision making as well as for analysis and adjustments in clinical research. Often risk scores are published without an easily accessible interface for calculation. And if tools exist, mostly these are web based user interfaces and therefor not suitable for either batch processing in research or integration into the hospital's clinical information system infrastructure. nnWe developed the _riskscorer_ package for easy and automatic clinical risk score calculation with the following features in mind:nn* simple programming interfacen* extensibilityn* flexible handling of differing data codingsn* individual patient risk calculation as well as the possibility of batch processingn* an HTTP web-service interface based on the plumber (https://github.com/trestletech/plumber) package for easy integration into an existing clinical information system infrastructurennCurrently three surgical risk scores are implemented: STS score (http://riskcalc.sts.org/), EuroScore I and EuroScore II (http://www.euroscore.org/). It is already used in our research and integration into our clinical information system is planned. The riskscorer package is under continues development and we have released the source code under the MIT license on the GitHub platform (https://github.com/meyera/riskscorer).nnThe integration of automated risk score calculation into the clinical workflow and into reproducible and efficient data analysis pipelines in research has the potential to improve patient outcomes.

Program Manager, Microsoft

Physician Scientist, Cardiac Surgeon, German Heart Institute Berlin

Big Data Healthcare Analytics
Deep Learning, Machine Learning
Graph Databases and R
NLP
Shiny, purrr, dplyr

Wednesday June 29, 2016 11:45am - 11:50am PDT

SIEPR 120

SIEPR 120

A spatial policy tool for cycling potential in England
**Remote Presentation via Skype**

Utility cycling is an increasingly common objective worldwide. The Propensity to Cycle Tool (PCT) www.pct.bike is a planning support system created using open source software; including R (Shiny) for data processing and (Leaflet) interactive visualisation. The project is funded by the UK Department for Transport.

We have developed the sustainable transport planning package (stplanr). Given two points: origin and destination (OD), it displays a straight line connecting them. To get a route, it relies on two APIs GraphHopper and CycleStreets. The GraphHopper API is global, whereas CycleStreets is UK specific. Cyclestreets API incorporates hilliness, giving faster and quieter routes. We have used MapShapper library to simplify the boundaries of the geographical data (shape files).

A geographical based multi-layered application has been developed using Shiny and Leaflet packages. The PCT represents current cycling and cycling potential based on OD data from the England 2011 Census. Cycling potential and the corresponding health and environmental benefits are modelled as a function of route distance, hilliness and other factors at OD and area level. One of the main hurdles was to incorporate complex spatial big data sets, and allow multiple web-users to concurrently use the tool. In order to load, manipulate and interrogate the data, we use on-demand innovative mechanisms to visualize it.

This talk explains the design, build and deployment of the PCT with an emphasis on reproducibility (e.g. creation of the stplanr package for data pre-processing), scalability (solved with the new JavaScript interface package MapShapper) and lessons learned.

**Moderators**
## Edzer Pebesma

**Speakers**

Utility cycling is an increasingly common objective worldwide. The Propensity to Cycle Tool (PCT) www.pct.bike is a planning support system created using open source software; including R (Shiny) for data processing and (Leaflet) interactive visualisation. The project is funded by the UK Department for Transport.

We have developed the sustainable transport planning package (stplanr). Given two points: origin and destination (OD), it displays a straight line connecting them. To get a route, it relies on two APIs GraphHopper and CycleStreets. The GraphHopper API is global, whereas CycleStreets is UK specific. Cyclestreets API incorporates hilliness, giving faster and quieter routes. We have used MapShapper library to simplify the boundaries of the geographical data (shape files).

A geographical based multi-layered application has been developed using Shiny and Leaflet packages. The PCT represents current cycling and cycling potential based on OD data from the England 2011 Census. Cycling potential and the corresponding health and environmental benefits are modelled as a function of route distance, hilliness and other factors at OD and area level. One of the main hurdles was to incorporate complex spatial big data sets, and allow multiple web-users to concurrently use the tool. In order to load, manipulate and interrogate the data, we use on-demand innovative mechanisms to visualize it.

This talk explains the design, build and deployment of the PCT with an emphasis on reproducibility (e.g. creation of the stplanr package for data pre-processing), scalability (solved with the new JavaScript interface package MapShapper) and lessons learned.

professor, University of Muenster

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Wednesday June 29, 2016 1:00pm - 1:18pm PDT

SIEPR 120

SIEPR 120

mumm: An R-package for fitting multiplicative mixed models using the Template Model Builder (TMB)

**Moderators**
## Patrícia Martinková

**Speakers**

Non-linear mixed models of various kinds are fundamental extensions of the linear mixed models commonly used in a wide range of applications. An important example of a non-linear mixed model is the so-called multiplicative mixed model, which we will consider as a model with a linear mixed model part and one or more multiplicative terms. A multiplicative term is here a product of a random effect and a fixed effect, i.e. a term that models a part of the interaction as a random coefficient model based on linear regression on the fixed effect. The multiplicative mixed model can be applied in many different fields for improved statistical inference, e.g. sensory and consumer data, genotype-by-environment data, and data from method comparison studies in medicine. However, the maximum likelihood estimation of the model parameters can be time consuming without proper estimation methods. Using automatic differentiation techniques, the Template Model Builder (TMB) R-package [Kristensen, 2014] fits mixed models through user-specified C++ templates in a very fast manner, making it possible to fit complex models with up to $10^6$ random effects within reasonable time. The mumm R-package uses the TMB package to fit multiplicative mixed models, such that the user avoids the coding of C++ templates. The package provides a function, where the user only has to give a model formula and a data set as input to get the multiplicative model fit together with standard model summaries such as parameter estimates and standard errors as output.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

Wednesday June 29, 2016 1:00pm - 1:18pm PDT

Econ 140

Econ 140

R markdown: Lifesaver or death trap?

**Moderators**
## Rasmus Arnling Bååth

**Speakers**

The popularity of R markdown is unquestionable, but will it prove as useful to the blind community as it is for our sighted peers? The short answer is "yes" but the more realistic answer is that it depends on so many other aspects some of which will remain outside the skill sets of many authors. Source R markdown files are plain text files, and are therefore totally accessible for a blind user. The documents generated from these source files for end-users differ in their accessibility; HTML is great and a pdf generated using LaTeX is very limited. International standards exist for ensuring most document formats are accessible, but the TeX ncommunity has not yet developed a tool for generating an accessible pdf document from any form of LaTeX source. There is little hope for any pdf containing mathematical expressions or graphical content. In contrast, the HTML documents created from R markdown can contain many aspects of accessibility with little or no additional work required from a document's author. A substantial problem facing any blind author wishing to create an HTML document from their R markdown files is that there is no simple editor available that is accessible; RStudio is not an option that can be used by blind people; until such time as an alternative tool becomes available, blind people will either have to use cumbersome work-arounds or rely on a small application we have built specifically for editing and processing R markdown documents.

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Wednesday June 29, 2016 1:00pm - 1:18pm PDT

McCaw Hall

McCaw Hall

Reusable R for automation, small area estimation and legacy systems

**Moderators**

**Speakers**

Running a complex model once is easy, just pull up your statistical program of choice, plug in the data, the model and off you go. The problem comes when you then find yourself trying to scale to running that model with different data hundreds or thousands of times. In order to scale and save analysts from spending all their time running models over and over again you need automation. You need a well-designed and tested environment. You need well-engineered R. You also need to sell it to analysts. We wanted to use the tools of software engineering and reusable research to allow statisticians and epidemiologists to be more efficient, but statisticians and epidemiologists are not computer scientists and a lot of this world is new to them. So we had to develop not only for good software practice but to ensure that others could use our tools, even when it comes with a very different focus to what they might be used to.

Using the example of batch small area estimation using generalized additive models, we will talk about the project, the tools we used and how to integrate R into a legacy SAS environment with a minimum of pain, allowing for uptake of the strengths of R without exposing new users to its complexity.

Using the example of batch small area estimation using generalized additive models, we will talk about the project, the tools we used and how to integrate R into a legacy SAS environment with a minimum of pain, allowing for uptake of the strengths of R without exposing new users to its complexity.

Wednesday June 29, 2016 1:00pm - 1:18pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Run-time Testing Using assertive

**Moderators**

**Speakers**

assertive is a group of R packages that lets you check that your code is running as you want it to. assert_* functions test a condition and throw an error if it fails, letting you write robust code more easily. Hundreds of checks are available for types and properties of variables, file and directory properties, numbers, strings, a variety of data types, the state of R, your OS and IDE, and many other conditions. The packages are optimised for easy to read code and easy to understand error messages.

Wednesday June 29, 2016 1:00pm - 1:18pm PDT

SIEPR 130

SIEPR 130

R at Google

I'll discuss

- R at Google - the explosive growth of R,interfaces between R and other parts of the Google computational infrastructure,documentation and support,
- Google contributions to the external community, and
- Google's culture of taking and supporting initiative, and how that fits with R.

Wednesday June 29, 2016 1:00pm - 1:20pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

A Lap Around R Tools for Visual Studio

**Moderators**
## Rasmus Arnling Bååth

**Speakers**

R Tools for Visual Studio is a new, Open Source and free tool for R Users built on top of the powerful Visual Studio IDE. In this talk, we will take you on a tour of its features and show you how they can help you be a more productive R user. We will look at: -Integrated debugging support -Variable/data frame visualization -Plotting and help integration -Using the Editor and REPL in concert with each other -RMarkdown and Shiny integration -Using Excel and SQL Server -Extensions and source control

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Wednesday June 29, 2016 1:18pm - 1:36pm PDT

McCaw Hall

McCaw Hall

How to use the archivist package to boost reproducibility of your research

The R package archivist allows you to share and reproduce R objects - artifacts with other researchers, either through a knitr script, embedded hooks in figure/table captions, shared folder or github/bitbiucket repositories.

Key functionalities of this package include: (i) management of local and remote repositories which contain R objects and objects meta-data (properties of objects and relations between them); (ii) archiving R objects to repositories; (iii) sharing and retrieving objects by their unique hooks; (iv) searching for objects with specific properties / relations to other objects; (v) verification of object's identity and object's context of creation.

The package archivist extends, in combination with packages such as knitr and Sweave, the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects. These functionalities also result in a variety of opportunities such as: sharing R objects within reports or articles by adding hooks to R objects in table or figure captions; interactive exploration of object repositories; caching function calls; retrieving object's pedigree along with information about session info.

Key functionalities of this package include: (i) management of local and remote repositories which contain R objects and objects meta-data (properties of objects and relations between them); (ii) archiving R objects to repositories; (iii) sharing and retrieving objects by their unique hooks; (iv) searching for objects with specific properties / relations to other objects; (v) verification of object's identity and object's context of creation.

The package archivist extends, in combination with packages such as knitr and Sweave, the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects. These functionalities also result in a variety of opportunities such as: sharing R objects within reports or articles by adding hooks to R objects in table or figure captions; interactive exploration of object repositories; caching function calls; retrieving object's pedigree along with information about session info.

Wednesday June 29, 2016 1:18pm - 1:36pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

SpatialProbit for fast and accurate spatial probit estimations

**Moderators**
## Edzer Pebesma

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

**Speakers**

This package meets the emerging needs of powerful and reliable models for the analysis of spatial discrete choice data. Since the explosion of available and voluminous geospatial and location data, existing estimation techniques cannot withstand the course of dimensionality and are restricted to samples counting having less than a few thousand observations.

The functions contained in SpatialProbit allow fast and accurate estimations of Spatial Autoregressive and Spatial Error Models under Probit specification. They are based on the full maximization of likelihood of an approximate multivariate normal distribution function, a task that was considered as prodigious just seven years ago (Wang et al. 2009). Extensive simulation and empirical studies proved that these functions can readily handle sample sizes with as many as several millions of observations, provided the spatial weight matrix is in convenient sparse form, as is typically the case for large data sets, where each observation neighbours only a few other observations.

SpatialProbit relies amongst others on Rcpp, RcppEigen and Matrix packages to produce fast computations for large sparse matrixes.nnPossible applications of spatial binary choice models include spread of diseases and pathogens, plants distribution, technology and innovation adoption, deforestation, land use change, amongst many others.

We will present the results of the SpatialProbit package for a large database on land use change at the plot level.

The functions contained in SpatialProbit allow fast and accurate estimations of Spatial Autoregressive and Spatial Error Models under Probit specification. They are based on the full maximization of likelihood of an approximate multivariate normal distribution function, a task that was considered as prodigious just seven years ago (Wang et al. 2009). Extensive simulation and empirical studies proved that these functions can readily handle sample sizes with as many as several millions of observations, provided the spatial weight matrix is in convenient sparse form, as is typically the case for large data sets, where each observation neighbours only a few other observations.

SpatialProbit relies amongst others on Rcpp, RcppEigen and Matrix packages to produce fast computations for large sparse matrixes.nnPossible applications of spatial binary choice models include spread of diseases and pathogens, plants distribution, technology and innovation adoption, deforestation, land use change, amongst many others.

We will present the results of the SpatialProbit package for a large database on land use change at the plot level.

professor, University of Muenster

Wednesday June 29, 2016 1:18pm - 1:36pm PDT

SIEPR 120

SIEPR 120

Tools for Robust R Packages

**Moderators**

**Speakers**

Building an R package is a great way of encapsulating code, documentation and data, in a single testable and easily distributable unit. At Mango we are building R packages regularly, and have been developing tools that ease this process and also ensure a high quality, maintainable software product. I will talk about some of them in this presentation. Our goodPractice package gives advice on good package building practices. It finds unsafe functions like sapply and sample; it calculates code complexity measures and draws function call graphs. It also incorporates existing packages for test coverage (covr) and source code linting (lintr). It can be used interactively, or in a continuous integration environment. The argufy package allows writing declarative argument checks and coercions for function arguments. The checking code is generated and included automatically. The progress package allows adding progress bars to loops and loop-like constructs (lapply, etc.) with minimal extra code and minimal runtime overhead. The pkgconfig package provides a configuration mechanism in which configuration settings from one package does not interfere with settings from another package.

Wednesday June 29, 2016 1:18pm - 1:36pm PDT

SIEPR 130

SIEPR 130

Visualizing multifactorial and multi-attribute effect sizes in linear mixed models with a view towards sensometrics

**Moderators**
## Patrícia Martinková

**Speakers**
## Per Bruun Brockhoff

In Brockhoff et al (2016), the close link between Cohen's d, the effect size in an ANOVA framework, and the so-called Thurstonian (Signal detection) d-prime was used to suggest better visualizations and interpretations of standard sensory and consumer data mixed model ANOVA results. The basic and straightforward idea is to interpret effects relative to the residual error and to choose the proper effect size measure. For multi-attribute bar plots of F-statistics this amounts, in balanced settings, to a simple transformation of the bar heights to get them transformed into depicting what can be seen as approximately the average pairwise d-primes between products. For extensions of such multi-attribute bar plots into more complex models, similar transformations are suggested and become more important as the transformation depends on the number of observations within factor levels, and hence makes bar heights better comparable for factors with differences in number of levels. For mixed models, where in general the relevant error terms for the fixed effects are not the pure residual error, it is suggested to base the d-prime-like interpretation on the residual error. The methods are illustrated on a multifactorial sensory profile data set and compared to actual d-prime calculations based on ordinal regression modelling through the ordinal package. A generic ``plug-in'' implementation of the method is given in the SensMixed package, which again depends on the lmerTest package. We discuss and clarify the bias mechanisms inherently challenging effect size measure estimates in ANOVA settings.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

Professor, DTU Compute, Danish Technical University

Statistics, Sensometrics, Chemometrics, Pharmacometrics.

Wednesday June 29, 2016 1:18pm - 1:36pm PDT

Econ 140

Econ 140

How Teradata Aster R Scales Data Science

**Moderators**

**Speakers**

One of the key advantages of using R for data mining and machine learning is that one may use the same environmentÊfor both data munging and algorithm execution. The problem with R, however, is that R's speed and memory constraints have limited the size of the datasets and the complexity of data.

Teradata's Aster R package was developed to lift these constraints by eliminating the requirement that the data be managed locally and by offering a large suite of multi-genre analytic functions that work via in-database execution. This session will examine a standard big data machine learning analytic workflow and present benchmarks from Teradata Aster R on several typical data cleaning and modeling tasks, showing how Teradata Aster R scales data science.

Teradata's Aster R package was developed to lift these constraints by eliminating the requirement that the data be managed locally and by offering a large suite of multi-genre analytic functions that work via in-database execution. This session will examine a standard big data machine learning analytic workflow and present benchmarks from Teradata Aster R on several typical data cleaning and modeling tasks, showing how Teradata Aster R scales data science.

Wednesday June 29, 2016 1:20pm - 1:30pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Statistical Computing with R

**Moderators**

**Speakers**

At Two Sigma, we find meaning in petabyte-scale data, and do so with efficiency. We leverage the breadth and flexibility of R data analytic environment on top of scalable data ingestion and distributed computing platforms. This approach enables us to conduct rapid prototyping and data exploration, as well as reproducible research. We will also discuss Beaker, a polyglot open source notebook that integrates easily with R and other data-oriented languages.

Wednesday June 29, 2016 1:30pm - 1:40pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Adding R, Jupyter and Spark to the toolset for understanding the complex computing systems at CERN's Large Hadron Collider

**Moderators**
## Rasmus Arnling Bååth

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

**Speakers**
## Dirk Duellmann

High Energy Physics (HEP) has a decades long tradition of statistical data analysis and of using large computing infrastructures. CERN's current flagship project LHC has collected over 100 PB of data, which is analysed in a wold-wide distributed computing grid by millions of jobs daily. Being a community with several thousand scientists, HEP also has a tradition of developing its own analysis toolset. In this contribution we will briefly outline the core physics analysis tasks and then focus on applying data analysis methods also to understand and optimise the large and distributed computing systems in the CERN computer centre and the world-wide LHC computing grid. We will describe the approach and tools picked for the analysis of metrics about job performance, disk and network I/O and the geographical distribution and access to physics data. We will present the technical and non-technical challenges in optimising a distributed infrastructure for large scale science projects and will summarise the first results obtained.

Data Scientist, King

Analysis & Design - Storage Group, CERN

quantitative understanding of large computing and storage systems

Wednesday June 29, 2016 1:36pm - 1:54pm PDT

McCaw Hall

McCaw Hall

Extending CRAN packages with binaries: x13binary

**Moderators**

**Speakers**

The x13binary package provides pre-built binaries of X-13ARIMA-SEATS, the seasonal adjustment software by the U.S. Census Bureau. X-13 is anwell-established tool for de-seasonalization of timeseries, and used bynstatistical offices around the world. Other packages such as seasonal can now rely on x13binary without requiring any intervention by the user. Together, these packages bring a very featureful and expressive interface for working with seasonal data to the R environment. Thanks to x13binary, installing seasonal is now as easy as any other CRAN package as it no longer requires a manual download and setup of the corresponding binary. Like the Rblpapi package, x13binary provides interesting new ways inndeploying binary software to aid CRAN: A GitHub repository provides the underlying binary in a per-operating system form (see the x13prebuilt repository). The actual CRAN package then uses this repo to download andninstall the binaries once per installation or upgrade. This talk will detail our approach, summarize our experience in providing binaries via CRAN and GitHub, and discuss possible future directions.

Wednesday June 29, 2016 1:36pm - 1:54pm PDT

SIEPR 130

SIEPR 130

GNU make for reproducible data analysis using R and other statistical software

**Moderators**

**Speakers** *PJ*
## Peter John Baker

As a statistical consultant, I often find myself repeating similar steps for data analysis projects. These steps follow a pattern of reading, cleaning, summarising, plotting and analysing data then producing a report. This is always an iterative process because many of these steps need to be repeated, especially when quality issues are present or overall goals change. Reproducibility becomes more difficult with increasing complexity.

For very small projects or toy examples, we may be able to do all analysis steps and reporting in a single markdown document. However, to increase efficiency for larger data analysis projects, a modular programming approach can be adopted. Each step in the process is then carried out using separate R syntax or markdown files. GNU Make automates the mundane task of regenerating output given dependencies between syntax, markdown and data files in a project. For instance, if we store results from time consuming analyses and radically change a report, we only need to rerun the R markdown file for reporting. On the other hand, if initial data are changed, we rerun everything. In both cases, we can set up our favourite IDE to use Make and simply press the 'build' button.

To extend Make for R, Rmarkdown, SAS and STATA, I have written pattern rules which are available on github. These are used by adding a single line to the project Makefile. An overall strategy and constructing a simple Makefile for a data analysis project will be briefly outlined and demonstrated.

For very small projects or toy examples, we may be able to do all analysis steps and reporting in a single markdown document. However, to increase efficiency for larger data analysis projects, a modular programming approach can be adopted. Each step in the process is then carried out using separate R syntax or markdown files. GNU Make automates the mundane task of regenerating output given dependencies between syntax, markdown and data files in a project. For instance, if we store results from time consuming analyses and radically change a report, we only need to rerun the R markdown file for reporting. On the other hand, if initial data are changed, we rerun everything. In both cases, we can set up our favourite IDE to use Make and simply press the 'build' button.

To extend Make for R, Rmarkdown, SAS and STATA, I have written pattern rules which are available on github. These are used by adding a single line to the project Makefile. An overall strategy and constructing a simple Makefile for a data analysis project will be briefly outlined and demonstrated.

Senior Lecturer/Statistician, University of Queensland

Statistical Consultant in Public Health. R user since late 90s. Written several R packages. Teach statistics and R

Wednesday June 29, 2016 1:36pm - 1:54pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Revisiting the Boston data set (Harrison and Rubinfeld, 1978)

**Moderators**
## Edzer Pebesma

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

**Speakers**

In the extended topical sphere of Regional Science, more scholars are addressing empirical questions using spatial and spatio-temporal data. An emerging challenge is to alert “new arrivals” to existing bodies of knowledge that can inform the ways in which they structure their work. It is a particular matter of opportunity and concern that most of the data used is secondary. This contribution is a brief review of questions of system articulation and support, illuminated retrospectively by a deconstruction of the Harrison and Rubinfeld (1978) Boston data set and hedonic house value analysis used to elicit willingness to pay for clean air.

professor, University of Muenster

Wednesday June 29, 2016 1:36pm - 1:54pm PDT

SIEPR 120

SIEPR 120

Simulation and power analysis of generalized linear mixed models

**Moderators**
## Patrícia Martinková

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

**Speakers**

As computers have improved, so has the prevalence of simulation studies to explore implications for assumption violations and explore statistical power. The simglm package allows for flexible simulation of general(ized) linear mixed models (multilevel models) under cross-sectional or longitudinal frameworks. In addition, the package allows for different distributional assumptions to be made such as non-normal residuals and random effects, missing data, and serial correlation. A power analysis by simulation can also be conducted by specifying a model to be simulated and the number of replications. This package can be useful for instructors or students for courses involving the general(ized) linear mixed model, as well as researchers looking to conduct simulations exploring the impact of assumption violations. The focus of the presentation will be on showing how to use the package, including live demos of the varying inputs and outputs, with working code. In addition to the syntax, a Shiny application will be made to show how the features can be made accessible to students in the classroom that are unfamiliar with R. The Shiny application will also provide a nice use case for the package, a live vignette of sorts.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Wednesday June 29, 2016 1:36pm - 1:54pm PDT

Econ 140

Econ 140

Changing lives with Data Science at Microsoft

**Moderators**

**Speakers**
## David Smith

Whether it's called data science, machine learning, or analytics, the combination of new data sources and statistical modeling has produced some truly revolutionary applications. Many of these applications incorporate open source technologies (including R) and research from academic institutions. In this talk, I'll share a few ways that Microsoft is improving the lives of people around the world by applying Statistics, research and open-source software in applications and devices.

Cloud Advocate, Microsoft

Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.

Wednesday June 29, 2016 1:40pm - 1:55pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Approximate inference in R: A case study with GLMMs and glmmsr
**ASA Grant Award Recipient**

The use of realistic statistical models for complex data is often hindered by the high cost of conducting inference about the model parameters. Because of this, it is sometimes necessary to use approximate inference methods, even though the impact of these approximations on the fitted model might not be well understood. I will discuss some practical examples of this, demonstrating how to fit various Generalized Linear Mixed Models with the R package glmmsr, using a variety of approximation methods, with a focus on what difference the choice of approximation makes to the resulting inference. I will talk about some more general issues along the way, such as how we might detect situations in which a given approximation might give unreliable inference, and the extent to which the choice of approximation method can and should be automated. I will finish by briefly reviewing some ideas about how best to share and discuss challenging models and datasets which could motivate the development of new approximation methods.

**Moderators**
## Patrícia Martinková

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

**Speakers**

The use of realistic statistical models for complex data is often hindered by the high cost of conducting inference about the model parameters. Because of this, it is sometimes necessary to use approximate inference methods, even though the impact of these approximations on the fitted model might not be well understood. I will discuss some practical examples of this, demonstrating how to fit various Generalized Linear Mixed Models with the R package glmmsr, using a variety of approximation methods, with a focus on what difference the choice of approximation makes to the resulting inference. I will talk about some more general issues along the way, such as how we might detect situations in which a given approximation might give unreliable inference, and the extent to which the choice of approximation method can and should be automated. I will finish by briefly reviewing some ideas about how best to share and discuss challenging models and datasets which could motivate the development of new approximation methods.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Wednesday June 29, 2016 1:54pm - 2:12pm PDT

Econ 140

Econ 140

Checkmate: Fast and Versatile Argument Checks

**Moderators**

**Speakers**

Dynamically typed programming languages like R allow programmers to interact with the language using an interactive Read-eval-print-loop (REPL) and to write generic, flexible and concise code. On the downside, as the R interpreter has no information about the expected data type, dynamically typed programming languages usually lack formal argument checks during runtime. Even worse, many R functions automatically convert the input to avoid throwing an exception. This results in exceptions which are hard to debug. In the worst case, the lack of argument checks leads to undetected errors and thus wrong results. To mitigate this issue, runtime assertions can be manually inserted into the code to ensure correct data types and content constraints, and useful debugging information is generated if the former are violated. The package checkmate offers an extensive set of functions to check the type and relevant characteristics of the most frequently used data types in R. For example, the function 'assertInteger' also allows to check for missing values, lower and upper bounds, min/exact/max length, duplicated values or names. The package is mostly written in C to avoid any unnecessary performance overhead. Thus, the programmer can write assertions which not only outperform custom R code for such purposes, but are also much shorter and more readable. Furthermore, checkmate can simplify the writing of unit tests by extending the testthat package with many new expectation functions. Third-party packages can link against checkmate's C code to conveniently check arbitrary SEXPs in compiled code.

Wednesday June 29, 2016 1:54pm - 2:12pm PDT

SIEPR 130

SIEPR 130

How to do one's taxes with R

**Moderators**
## Rasmus Arnling Bååth

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

**Speakers**

In this talk it is shown how to generate a return of tax (German VAT) with R and send it over the internet to the tax administration. As this is certainly not a standard application for R (special software exists for this purpose) it may be worthwhile to have a closer look at the techniques used to realize such kind of transaction and to reveal any analogies to distributed data analysis. If confidential data cannot be analysed in the environment where it is created or stored, it has to be transferred over the internet to some kind of nexecution service, e.g. a cluster system. Encryption is necessary to protect the data as well as appending a digital signature to guarantee ownership nand prevent modification. Additionally some kind of packaging has to be applied to the data together with metadata giving directions for the receiver to handle the delivery. When returning the result the same techniques are used. So again privacy and authorship are ensured. For the tax example all these procedures have to observe well established cryptographic standards for encryption, hashing and digital signatures which change from time to time according to new results in cryptographic research. I demonstrate an implemenation in R for this kind of transaction in a data science context, trying to use the same rigorous standards mentioned above whenever possible. This leads to an overview of existing R packages and external software useful and necessary to realize a corresponding program. Finally some proposals for a possible standardization of a secure distributed data analysis scenario are presented.

Data Scientist, King

Wednesday June 29, 2016 1:54pm - 2:12pm PDT

McCaw Hall

McCaw Hall

Spatial data in R: simple features and future perspectives

**Moderators**
## Edzer Pebesma

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Simple feature access is an open standard for handling feature data (mostly points, lines and polygons) that has seen wide adoption in databases, javascript, and linked data. Currently, R does not have a complete solution for reading, handling and writing simple feature data. With funding from the R consortium, we will implement support for simple features in R. This talk discusses the challenges and potential benefits when doing so. It also points out challenges when analysing attribute data associated with simple features. In particular, the question whether a property refers to a property at every location of a feature (such as the land cover of a polygonndelineating a forest) or merely to a summary statistic computed over the whole feature (such as population count over a county) is to be solved. Its consequences for further analysis and data integration will be illustrated, and solutions will be discussed.

professor, University of Muenster

Wednesday June 29, 2016 1:54pm - 2:12pm PDT

SIEPR 120

SIEPR 120

The simulator: An Engine for Streamlining Simulations

Methodological statisticians spend an appreciable amount of their time writing code for simulation studies. Every paper introducing a new method has a simulation section in which the new method is compared across several metrics to preexisting methods under various scenarios. Given the formulaic nature of the simulation studies in most statistics papers, there is a lot of code that can be reused. We have developed an R package, called the "simulator", that streamlines the process of performing simulations by creating a common infrastructure that can be easily used and reused across projects. The simulator allows the statistician to focus exclusively on those aspects of the simulation that are specific to the particular paper being written. Code for simulations written with the simulator is succinct, highly readable, and easily shared with others. The modular nature of simulations written with the simulator promotes code reusability, which saves time and facilitates reproducibility. Other benefits of using the simulator include the ability to "step in" to a simulation and change one aspect without having to rerun the entire simulation from scratch, the straightforward integration of parallel computing into simulations, and the ability to rapidly generate plots and tables with minimal effort.

Wednesday June 29, 2016 1:54pm - 2:12pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Bringing the Power of R to Citizen Data Scientists

Organizations have an increasing amount of data that can be converted into information that offers the ability to make better decisions, find new opportunities, and improve efficiency. R is a critical advanced analytics tool that many organizations use to turn data into usable information. While the R language is comparatively easy to learn, it is still a traditional, written, computer programming language. Unfortunately, this fact alone limits the potential diffusion of R across most organizations, reducing its potential benefit to an organization.

Alteryx is a platform for data blending and advanced analytics. Its objective is to empower a greater number of individuals across an organization to successfully accomplish these tasks, improving the overall performance of an organization. Alteryx uses R as a key element in powering much of its advanced analytics capabilities, in a much easier to approach interface. Alteryx itself is best viewed as both a data pipelining engine and a visual programming framework.

In this talk, we introduce ourselves, highlight the benefits we provide our customers, particularly as it relates to R, and cover some of the things we do to help support the R community.

Alteryx is a platform for data blending and advanced analytics. Its objective is to empower a greater number of individuals across an organization to successfully accomplish these tasks, improving the overall performance of an organization. Alteryx uses R as a key element in powering much of its advanced analytics capabilities, in a much easier to approach interface. Alteryx itself is best viewed as both a data pipelining engine and a visual programming framework.

In this talk, we introduce ourselves, highlight the benefits we provide our customers, particularly as it relates to R, and cover some of the things we do to help support the R community.

Wednesday June 29, 2016 1:55pm - 2:05pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

Making the R community more open

The R community has historically done a pretty bad job at welcoming newcomers, and has been pretty “closed” ironically. Documentation written by experts for experts, email lists being a frightening place, and conferences aimed at the in-crowd, … Luckily, multiple people and companies have been working hard in the last few years to make the R community more welcoming and open to newcomers. In this talk, we want to highlight some of these efforts.

DataCamp has trained ~400,000 people in R. As a sponsor, we want to highlight some of the efforts and projects DataCamp is undertaking as well:

DataCamp has trained ~400,000 people in R. As a sponsor, we want to highlight some of the efforts and projects DataCamp is undertaking as well:

- The new version of the RDocumentation package offers an augmented version of the documentation that can be used in the RStudio IDE, and allows to give feedback to package authors.
- datacamp.com/teach enables anyone to create open(-source) courses on DataCamp using R Markdown.
- The tutorial package lets you create interactive R/Python challenges you can embed in blogs, vignettes, etc.
- DataCamp’s course curriculum, which now features courses built by some of the best people from the R community: Mine Çetinkaya-Rundel, Matt Dowle, Garrett Grolemund, Max Kuhn, Daniel Kaplan, Ted Kwartler, Zach Deane-Mayer, Jeffrey Ryan, Joshua ulrich, Hadley & Charlotte Wickham, …
- DataCamp for Groups is being used by university Professors and team leaders to train groups of people.

Wednesday June 29, 2016 2:05pm - 2:25pm PDT

Lane & Lyons & Lodato

Lane & Lyons & Lodato

brglm: Reduced-bias inference in generalized linear models

**Moderators**
## Patrícia Martinková

Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.

**Speakers**
## Ioannis Kosmidis

I am a Senior Lecturer at the Department of Statistical Science in University College London. My theoretical and methodological research focuses on optimal estimation and inference from complex statistical models, penalized likelihood methods and clustering. A particular focus of... Read More →

This presentation focuses on the brglm R package, which provides methods for reduced-bias inference in univariate generalised linear models and multinomial regression models with either ordinal or nominal responses (Kosmidis, 2014, JRSSB and Kosmidis and Firth, 2011, Biometrika, respectively). The core fitting method is based on the iterative correction of the bias of the maximum likelihood estimator, and results in the solution of appropriate bias-reducing adjusted score equations. For multinomial logistic regression, we present alternative algorithms that can scale up well with the number of multinomial responses and illustrate the finiteness and shrinkage properties that make bias reduction attractive for such models. For families with dispersion parameters (e.g. gamma regression), brglm uses automatic differentiation to compute the reduced-bias estimator of arbitrary invertible transformations of the dispersion parameter (e.g. user-supplied). We also present the implementation of appropriate methods for inference when bias-reduced estimation is being used.

Researcher, Institute of Computer Science, Czech Academy of Sciences

Associate Professor, Department of Statistical Science, University College London

Wednesday June 29, 2016 2:12pm - 2:30pm PDT

Econ 140

Econ 140

High performance climate downscaling in R

**Moderators**
## Edzer Pebesma

My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

**Speakers**
## James Hiebert

Global Climate Models (GCMs) can be used to assess the impacts of future climate change on particular regions of interest, municipalities or pieces of infrastructure. However, the coarse spatial scale of GCM grids (50km or more) can be problematic to engineers or others interested in more localized conditions. This is particularly true in areas with high topographic relief and/or substantial climate heterogeneity. A technique in climate statistics known as "downscaling" exists to map coarse scale climate quantities to finer scales.

Two main challenges are posed to potential climate downscalers: there exist few open source implementations of proper downscaling methods and many downscaling methods necessarily require information across both time and space. This requires high computational complexity to implement and substantial computational resources to execute.

The Pacific Climate Impacts Consortium in Victoria, BC, Canada has written a high-performance implementation of the detrended quantile mapping (QDM) method for downscaling climate variables. QDM is a proven climate downscaling technique that has been shown to preserve the relative changes in both the climate means and the extremes. We will release this software named "ClimDown" to CRAN under an open source license. Using ClimDown, researchers can downscale spatially coarse global climate models to an arbitrarily fine resolution for which gridded observations exist.

Our proof-of-concept for ClimDown has been to downscale all of Canada at 10km resolution for models and scenarios from the IPCC's Coupled Model Intercomparison Project. We will present our results and performance metrics from this exercise.

professor, University of Muenster

University of Victoria

My background is in Computer Science and I've spent my career applying CS to the earth and ocean sciences. I have worked for NOAA on mapping the ocean floors and have worked at the Pacific Climate Impacts Consortium (University of Victoria) making climate change information relevant... Read More →

Wednesday June 29, 2016 2:12pm - 2:30pm PDT

SIEPR 120

SIEPR 120

Providing Digital Provenance: from Modeling through Production

**Moderators**

**Speakers**
## Eduardo Ariño de la Rubia

Reproducibility is important throughout the entire data science process. As recent studies have shown, subconscious biases in the exploratory analysis phase of a project can have vast repercussions over final conclusions. The problems with managing the deployment and life-cycle of models in production are vast and varied, and often reproducibility stops at the level of the individual analyst. Though R has best in class support for reproducible research, with tools like KnitR to packrat, they are limited in their scope. In this talk we present a solution we have developed at Domino, which allows for every model in production to have full reproducibility from EDA to the training run and exact datasets which were used to generate. We discuss how we leverage Docker as our reproducibility engine, and how this allows us to provide the irrefutable provenance of a model.

Chief Data Scientist in Residence, Domino Data Lab

Eduardo Arino de la Rubia is Chief Data Scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science department... Read More →

Wednesday June 29, 2016 2:12pm - 2:30pm PDT

Barnes & McDowell & Cranston

Barnes & McDowell & Cranston

Using R in a regulatory environment: FDA experiences.

**Moderators**
## Rasmus Arnling Bååth

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

**Speakers**

The Food and Drug Administration (FDA) regulates products which account for approximately one fourth of consumer spending in the United States of America, and has global impact, particularly for medical products. This talk will discuss the Statistical Software Clarifying Statement (http://www.fda.gov/ForIndustry/DataStandards/StudyDataStandards/ucm445917.htm), which corrects the misconception that FDA requires the use of proprietary software for FDA submissions. Next, we will describe several use cases for R at FDA, including review work, research, and collaborations with industry, academe and other government agencies. We describe advantages, challenges and opportunities of using R in a regulatory setting. Finally, we close with a brief demonstration of a Shiny openFDA application for the FDA Adverse Event Reporting System (FAERS) available at https://openfda.shinyapps.io/LRTest/.

Data Scientist, King

Wednesday June 29, 2016 2:12pm - 2:30pm PDT

McCaw Hall

McCaw Hall

ALZCan: Predicting Future Onset of Alzheimer's Using Gender, Genetics, Cognitive Tests, CSF Biomarkers, and Resting State fMRI Brain Imaging
**Poster #13**

Due to a lack of preventive methods and precise diagnostic tests, only 45% of Alzheimer’s patients are told about their diagnosis. I hypothesized that one can create an accurate diagnostic/prognostic software tool for early detection of Alzheimer's using functional connectivity in resting-state fMRI brain imaging, genetic SNP data, cerebrospinal fluid (CSF) concentrations, demographic information, and psychometric tests.nnUsing R programming language and data from ADNI, an ongoing, longitudinal, global effort tracking clinical/imaging AD biomarkers, I examined 678 4D fMRI scans and 56847 observations of 1722 individuals across three diagnostic groups. ICA on fMRI scans yielded graph structures of connectivity between brain networks. For diagnosis, 4 support vector machines and 6 gradient boosting machines were trained 10 times each for fMRI, genetic, CSF biomarker, and cognitive data. For prognosis, 3 linear regression models predicted cognitive scores 6 to 60 months into the future. Forecasted cognitive scores and demographic information were used for prognosis.nnALZCan had 81.82% diagnostic accuracy. Prognostic accuracy for 6, 12, 18 months in future was 75.4%, 68.3%, 68.6%. AD patients showed significantly lower transitivity and average path length between functional brain networks. I examined relative influence/predictive power of multiple biomarkers, confirming previous findings that gender has higher influence than genetic factors on AD diagnosis. Overall, this study engineered a novel neuroimaging feature selection method by using machine learning and graph-theoretic functional network connectivity properties for diagnosis/prognosis of disease states. This analytical tool is capable of predicting future onset of Alzheimer’s and Mild Cognitive Impairment with significant accuracy.

Due to a lack of preventive methods and precise diagnostic tests, only 45% of Alzheimer’s patients are told about their diagnosis. I hypothesized that one can create an accurate diagnostic/prognostic software tool for early detection of Alzheimer's using functional connectivity in resting-state fMRI brain imaging, genetic SNP data, cerebrospinal fluid (CSF) concentrations, demographic information, and psychometric tests.nnUsing R programming language and data from ADNI, an ongoing, longitudinal, global effort tracking clinical/imaging AD biomarkers, I examined 678 4D fMRI scans and 56847 observations of 1722 individuals across three diagnostic groups. ICA on fMRI scans yielded graph structures of connectivity between brain networks. For diagnosis, 4 support vector machines and 6 gradient boosting machines were trained 10 times each for fMRI, genetic, CSF biomarker, and cognitive data. For prognosis, 3 linear regression models predicted cognitive scores 6 to 60 months into the future. Forecasted cognitive scores and demographic information were used for prognosis.nnALZCan had 81.82% diagnostic accuracy. Prognostic accuracy for 6, 12, 18 months in future was 75.4%, 68.3%, 68.6%. AD patients showed significantly lower transitivity and average path length between functional brain networks. I examined relative influence/predictive power of multiple biomarkers, confirming previous findings that gender has higher influence than genetic factors on AD diagnosis. Overall, this study engineered a novel neuroimaging feature selection method by using machine learning and graph-theoretic functional network connectivity properties for diagnosis/prognosis of disease states. This analytical tool is capable of predicting future onset of Alzheimer’s and Mild Cognitive Impairment with significant accuracy.

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Analyzing and visualizing spatially and temporally variable floodplain inundation
**Poster #24**

With the continuing degradation of riverine ecosystems, advancing our understanding of the spatially and temporally variable floodplain conditions produced by a river’s flood regime is essential to better manage these systems for greater ecological integrity. This requires development of analysis and visualization techniques for multi-dimensional spatio-temporal data. Research presented here applies 2D hydrodynamic modeling output of a floodplain restoration site along the lower Cosumnes River, California in R to analyze this spatio-temporal raster data and develop informative and engaging visualizations. Modeling output is quantified and compared within and across modeled flood events in space and time using metrics such as depth, velocity, and duration. To aid comparison and interpretation, rasters of model time steps are also summarized by integrating across space as well as across time. Data manipulation and summary is performed primarily within the raster package. This research presents new methods for quantifying and visualizing hydrodynamic modeling outcomes, improving understanding of the complex and variable floodplain inundation patterns that drive ecosystem function and process.

**Speakers**
## Alison A. Whipple

With the continuing degradation of riverine ecosystems, advancing our understanding of the spatially and temporally variable floodplain conditions produced by a river’s flood regime is essential to better manage these systems for greater ecological integrity. This requires development of analysis and visualization techniques for multi-dimensional spatio-temporal data. Research presented here applies 2D hydrodynamic modeling output of a floodplain restoration site along the lower Cosumnes River, California in R to analyze this spatio-temporal raster data and develop informative and engaging visualizations. Modeling output is quantified and compared within and across modeled flood events in space and time using metrics such as depth, velocity, and duration. To aid comparison and interpretation, rasters of model time steps are also summarized by integrating across space as well as across time. Data manipulation and summary is performed primarily within the raster package. This research presents new methods for quantifying and visualizing hydrodynamic modeling outcomes, improving understanding of the complex and variable floodplain inundation patterns that drive ecosystem function and process.

PhD Candidate, University of California, Davis

analyzing and visualizing spatio-temporal data

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Approaches to R education in Canadian universities
**Poster #4**

The R language is a powerful tool used in a wide array of research disciplines and owes a large amount of its success to its open source and adaptable nature. This has caused rapid growth of formal and informal online and text resources that is beginning to present challenges to novices learning R. Students are often first exposed to R in upper division undergraduate classes or during their graduate studies. The way R is presented has consequences for the fundamental understanding of the program and language itself. That is to say there is a dramatic difference in user comprehension of R if learning it as a tool to do an analysis opposed to learning another subject (e.g. statistics) using R. While some universities do offer courses specific to R it is more commonly incorporated into a pre-existing course or a student is left to learn the program on his or her own. To better establish how students are exposed to R, an understanding of the approaches to R education is critical. In this survey we evaluated the current use of R in Canadian university courses to determine what methods are most common for presenting R. While data are still being collected we anticipate that courses using R to teach another concept will be much more common than courses dedicated to R itself. This information will influence how experienced educators as well as programmers approach R, specifically when developing educational and supplemental content in online, text, and package specific formats.

The R language is a powerful tool used in a wide array of research disciplines and owes a large amount of its success to its open source and adaptable nature. This has caused rapid growth of formal and informal online and text resources that is beginning to present challenges to novices learning R. Students are often first exposed to R in upper division undergraduate classes or during their graduate studies. The way R is presented has consequences for the fundamental understanding of the program and language itself. That is to say there is a dramatic difference in user comprehension of R if learning it as a tool to do an analysis opposed to learning another subject (e.g. statistics) using R. While some universities do offer courses specific to R it is more commonly incorporated into a pre-existing course or a student is left to learn the program on his or her own. To better establish how students are exposed to R, an understanding of the approaches to R education is critical. In this survey we evaluated the current use of R in Canadian university courses to determine what methods are most common for presenting R. While data are still being collected we anticipate that courses using R to teach another concept will be much more common than courses dedicated to R itself. This information will influence how experienced educators as well as programmers approach R, specifically when developing educational and supplemental content in online, text, and package specific formats.

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Bayesian inference for Internet ratings data using R
**Poster #30**

Internet ratings data are usually ordinal measurements nfrom 1 to 5 (or 10) rated by Internet users on the quality nof all kinds of items. The traditional graphical displays nof the ratings data does not account for the inter-rater difference.nSome model-based methods with MCMC approach have beennsuggested to address this problem.nIn the present work we propose a real-time Bayesian inferencenalgorithm for parameter estimation.nTwo real data sets and the R implementation of the abovenmentioned algorithm will be presented.

**Speakers**
## Ruby Chiu-Hsing Weng

Internet ratings data are usually ordinal measurements nfrom 1 to 5 (or 10) rated by Internet users on the quality nof all kinds of items. The traditional graphical displays nof the ratings data does not account for the inter-rater difference.nSome model-based methods with MCMC approach have beennsuggested to address this problem.nIn the present work we propose a real-time Bayesian inferencenalgorithm for parameter estimation.nTwo real data sets and the R implementation of the abovenmentioned algorithm will be presented.

Professor, Department of Statistics, National Chengchi Univ.

Ph.D., Statistics, University of Michigan
My research interests include sequential analysis, time series analysis, Bayesian inference and machine learning.

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Data Quality Profiling - The First Step with New Data
**Poster #23**

The first step, when getting a new data set, is to take a look at the data for completeness, accuracy, and reasonableness. This talk will describe a method based on Jack Olson's Data Quality - The Accuracy Dimension. The input data set can be either a raw text or spreadsheet file or from a source with columnar meta-data like a SQL table or an R data frame. The only setup is to connect to the data source. Using RMarkdown, dplyr, grid, and ggplot2 we produce a report where each column is profiled by data types, summary statistics (if numeric or date), distribution plot, counts, and the head and tail values. This facilitates a quick visual scan of each column for data quality issues. The simple visual format also aids communication with the data provider to dig into quality issues and, hopefully, clean up the data set before wasting time and effort on an analysis flawed by bad data. We provide examples both good and suspect columns.

**Speakers**
## Jim Porzak

The first step, when getting a new data set, is to take a look at the data for completeness, accuracy, and reasonableness. This talk will describe a method based on Jack Olson's Data Quality - The Accuracy Dimension. The input data set can be either a raw text or spreadsheet file or from a source with columnar meta-data like a SQL table or an R data frame. The only setup is to connect to the data source. Using RMarkdown, dplyr, grid, and ggplot2 we produce a report where each column is profiled by data types, summary statistics (if numeric or date), distribution plot, counts, and the head and tail values. This facilitates a quick visual scan of each column for data quality issues. The simple visual format also aids communication with the data provider to dig into quality issues and, hopefully, clean up the data set before wasting time and effort on an analysis flawed by bad data. We provide examples both good and suspect columns.

Principal, DS4CI

I am a (semi-)retired data scientist specializing in customer insights. I have been using R since 2002 and have presented at all but two useR! conferences starting with the first Vienna useR! 2004. See my archives, ds4ci.org/archives/ for past presentations including tutorials at... Read More →

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Developing R Tools for Energy Data Analysis
**Poster #15**

Energy efficiency program evaluators use data from a variety of sources, which range from utility billing databases to surveys to logs from smart thermostats. We estimate reductions in energy usage attributable to various energy savings programs. Our evaluations are used to certify that utilities are meeting state or federally mandated efficiency standards; to reimburse utilities for funds spent on what amounts to reducing demand for power; or to assess whether programs are worth continuing or expanding. It is therefore critical that our estimates be reliable and reproducible. nnThe raw data we receive is usually very messy, and the types of problems we find in the data are sufficiently niche that off-the-shelf data cleaning software is of limited help. nnOur poster demonstrates a suite of custom R packages and Shiny apps that we've developed to streamline the process of prepping energy use data for analysis. Our current tools include:n• A package, noaaisd, which pulls appropriate hourly weather data from the NOAA website and appends it to geocoded customer usage datan• A Shiny app that allows users to quickly explore individual smart thermostat logs and save information about the quality of each log’s data nnFeatures under development include:n• A package to automate the cleaning of utility billing data (this requires, among other things, the ability to detect and appropriately correct gaps or overlaps in billing periods for individual customers, and the ability to flag abnormal billing periods or energy consumption)n• A Shiny app to help build baseline energy usage models

Energy efficiency program evaluators use data from a variety of sources, which range from utility billing databases to surveys to logs from smart thermostats. We estimate reductions in energy usage attributable to various energy savings programs. Our evaluations are used to certify that utilities are meeting state or federally mandated efficiency standards; to reimburse utilities for funds spent on what amounts to reducing demand for power; or to assess whether programs are worth continuing or expanding. It is therefore critical that our estimates be reliable and reproducible. nnThe raw data we receive is usually very messy, and the types of problems we find in the data are sufficiently niche that off-the-shelf data cleaning software is of limited help. nnOur poster demonstrates a suite of custom R packages and Shiny apps that we've developed to streamline the process of prepping energy use data for analysis. Our current tools include:n• A package, noaaisd, which pulls appropriate hourly weather data from the NOAA website and appends it to geocoded customer usage datan• A Shiny app that allows users to quickly explore individual smart thermostat logs and save information about the quality of each log’s data nnFeatures under development include:n• A package to automate the cleaning of utility billing data (this requires, among other things, the ability to detect and appropriately correct gaps or overlaps in billing periods for individual customers, and the ability to flag abnormal billing periods or energy consumption)n• A Shiny app to help build baseline energy usage models

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion

Educational Disparities, Biomedical Efficacy and Science Knowledge Gaps: can the Internet help us reduce these inequalities?
**Poster #2**

The economic, health, and knowledge disparities between the world’s “haves” and “have-nots” are some of the key issues we face in this day and age. (World Economic Forum, 2011) Unfortunately, very little communication research has been applied to understanding what we can do to help reduce these inequalities. Even more worryingly, some studies have found that feeding more information to the public through traditional media has the adverse effect of widening gaps based on educational disparities. ( Tichenor, Donohue, Olien, 1970) We study the impact that Internet use has on the disparity between lowly and highly educated citizens in terms of their science (biomedical) knowledge, as well as their sense of efficacy regarding medical research. For this, we employ Wave II of the Wellcome Trust Monitor Survey (2012), which is fielded to a nationally representative sample of the UK population. We conduct a series of moderated regression models with mean centring using the ‘lmres’ function in the ‘pequod’ package. (Mirisola, A. & Seta, L., 2016) We also use the ‘simpleSlope’ and ‘PlotSlope’ functions in order to do a simple slope analysis, as well as to create two and three-way interaction plots. These functions are comprehensive of what the statistical literature recommends for such tests, and they save time and effort by reducing the number of analytical steps. R helped us find that increased Internet use in the lower education group can help significantly narrow both knowledge and efficacy gaps that emerge from educational disparities. Implications for science communication are discussed.

The economic, health, and knowledge disparities between the world’s “haves” and “have-nots” are some of the key issues we face in this day and age. (World Economic Forum, 2011) Unfortunately, very little communication research has been applied to understanding what we can do to help reduce these inequalities. Even more worryingly, some studies have found that feeding more information to the public through traditional media has the adverse effect of widening gaps based on educational disparities. ( Tichenor, Donohue, Olien, 1970) We study the impact that Internet use has on the disparity between lowly and highly educated citizens in terms of their science (biomedical) knowledge, as well as their sense of efficacy regarding medical research. For this, we employ Wave II of the Wellcome Trust Monitor Survey (2012), which is fielded to a nationally representative sample of the UK population. We conduct a series of moderated regression models with mean centring using the ‘lmres’ function in the ‘pequod’ package. (Mirisola, A. & Seta, L., 2016) We also use the ‘simpleSlope’ and ‘PlotSlope’ functions in order to do a simple slope analysis, as well as to create two and three-way interaction plots. These functions are comprehensive of what the statistical literature recommends for such tests, and they save time and effort by reducing the number of analytical steps. R helped us find that increased Internet use in the lower education group can help significantly narrow both knowledge and efficacy gaps that emerge from educational disparities. Implications for science communication are discussed.

Wednesday June 29, 2016 2:30pm - 3:30pm PDT

Sponsor Pavilion

Sponsor Pavilion