This event has ended. Visit the official site or create your own event on Sched.

user2016

Click here to return to main conference site. For a one page, printable overview of the schedule, see this.

10:30am PDT

R in machine learning competitions

Kaggle is a community of almost 450K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons from winning techniques, with a particular emphasis on best practice R use.

Moderators

Heather Turner

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Speakers

Anthony Goldbloom

CEO, Kaggle

Anthony is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.He holds a first class honours degree in economics and econometrics from the University of... Read More →

Tuesday June 28, 2016 10:30am - 10:48am PDT
McCaw Hall

Contributed talk, Kaleidoscope

10:48am PDT

Connecting R to the OpenML project for Open Machine Learning

OpenML is an online machine learning platform where researchers can automatically log and share data, code, and experiments, and organize them online to work and collaborate more effectively. We present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. We show how the OpenML package allows R users to easily search, download and upload machine learning datasets. Users can easily log their auto ML experiment results online, have them evaluated on the server, share them with others and download results from other researchers to build on them. Beyond ensuring reproducibility of results, it automates much of the drudge work, speeds up research, facilitates collaboration and increases user's visibility online. Currently, OpenML has 1,000+ registered users, 2,000+ unique monthly visitors, 2,000+ datasets, and 500,000+ experiments. The OpenML server currently supports client interfaces for Java, Python, .NET and R as well as specific interfaces for the WEKA, MOA, RapidMiner, scikit-learn and mlr toolboxes for machine learning.

Moderators

Heather Turner

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Speakers

Joaquin Vanschoren

Assistant Professor, Eindhoven University of Technology

My research focuses on the automation and democratization of machine learning. I founded OpenML.org, a collaborative machine learning platform where scientists can automatically log and share data, code, and experiments, and which automatically learns from all this data to help people... Read More →

Tuesday June 28, 2016 10:48am - 11:06am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:06am PDT

trackeR: Intrastructure for running and cycling data from GPS-enabled tracking devices in R

The use of GPS-enabled tracking devices and heart rate monitors is becoming increasingly common in sports and fitness activities. The trackeR package aims to fill the gap between the routine collection of data from such devices and their analyses in a modern statistical environment like R. The package provides methods to read tracking data and store them in session-based, unit-aware, and operation-aware objects of class trackeRdata. The package also implements core infrastructure for relevant summaries and visualisations, as well as support for handling units of measurement. There are also methods for relevant analytic tools such as time spent in zones, work capacity above critical power (known as W'), and distribution and concentration profiles. A case study illustrates how the latter can be used to summarise the information from training sessions and use it in more advanced statistical analyses.

Moderators

Heather Turner

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Speakers

Hannah Frick

Research associate, University College London

Tuesday June 28, 2016 11:06am - 11:24am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:24am PDT

United Nations World Population Projections with R

Recently, the United Nations adopted a probabilistic approach to projecting fertility, mortality and population for all countries. In this approach, the total fertility and female and male life expectancy at birth are projected using Bayesian hierarchical models estimated via Markov Chain Monte Carlo. They are then combined yielding probabilistic projections for any population quantity of interest. The methodology is implemented in a suite of R packages which has been used by the UN to produce the most recent revision of the World Population Prospects. I will summarize the main ideas behind each of the packages, namely bayesTFR, bayesLife, bayesPop, bayesDem, and the shiny-based wppExplorer. I will also touch on our experience of the collaboration between academics and the UN.

Moderators

Heather Turner

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Speakers

Hana Ševčíková

University of Washington

Tuesday June 28, 2016 11:24am - 11:42am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:42am PDT

jailbreakr: Get out of Excel, free

One out of every ten people on the planet uses a spreadsheet and about half of those use formulas: "Let's not kid ourselves: the most widely used piece of software for statistics is Excel." (Ripley, 2002) Those of us who script analyses are in the distinct minority! There are several effective packages for importing spreadsheet data into R. But, broadly speaking, they prioritize access to [a] data and [b] data that lives in a neat rectangle. In our collaborative analytical work, we battle spreadsheets created by people who did not get this memo. We see messy sheets, with multiple data regions sprinkled around, mixed with computed results and figures. Data regions can be a blend of actual data and, e.g., derived columns that are computed from other columns. We will present our work on extracting tricky data and formula logic out of spreadsheets. To what extent can data tables be automatically identified and extracted? Can we identify columns that are derived from others in a wholesale fashion and translate that into something useful on the R side? The goal is to create a more porous border between R and spreadsheets. Target audiences include novices transitioning from spreadsheets to R and experienced useRs who are dealing with challenging sheets.

Moderators

Heather Turner

Freelance consultant

I'm a freelance statistical computing consultant providing support in R to people in a range of industries, but particularly the life sciences. My interests include statistical modelling, clustering, bioinformatics, reproducible research and graphics. I chair the core group of Forwards... Read More →

Speakers

Jenny Bryan

University of British Columbia, rOpenSci

Tuesday June 28, 2016 11:42am - 12:00pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:00pm PDT

Linking htmlwidgets with crosstalk and mobservable

The htmlwidgets package makes it easy to create interactive JavaScript widgets from R, and display them from the R console or insert them into R Markdown documents and Shiny apps. These widgets exhibit interactivity "in the small": they can interact with mouse clicks and other user gestures within their widget boundaries. This talk will focus on interactivity "in the large", where interacting with one widget results in coordinated changes in other widgets (for example, select some points in one widget and the corresponding observations are instantly highlighted across the other widgets). This kind of inter-widget interactivity can be achieved by writing a Shiny app to coordinate multiple widgets (and indeed, this is a common way to use htmlwidgets). But some situations call for a more lightweight solution. crosstalk and robservable are two distinct but complementary approaches to the problem of widget coordination, authored by myself and Ramnath Vaidyanathan, respectively. Each augments htmlwidgets with pure-JavaScript coordination logic; neither requires Shiny (or indeed any runtime server support at all). The resulting documents can be hosted on GitHub, RPubs, Amazon S3, or any static web host. In this talk, I'll demonstrate these new tools, and discuss their advantages and limitations compared to existing approaches.

Moderators

Torben Tvedebrink

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Speakers

Joe Cheng

CTO, RStudio, PBC

Joe Cheng is the Chief Technology Officer at RStudio PBC, where he helped create the RStudio IDE and Shiny web framework.

Tuesday June 28, 2016 1:00pm - 1:18pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:18pm PDT

Transforming a museum to be data-driven using R

With the exponential growth of data, more and more businesses are demanding to become data-driven. Seeking value from their data, big data and data science initiatives; jobs and skill sets have risen up the business agenda. R, being a data scientists' best friend, plays an important role in this transformation. But how do you transform a traditionally un-data-orientated business into being data-driven armed with R, data science processes and plenty of enthusiasm?

The first data scientist at a museum shares her experience on the journey to transform the 250-year-old British Museum to be data-driven by 2018. How is one of the most popular museums in the world, with 6.8 million annual visitors, using R to achieve a data-driven transition?• Data wrangling • Exploring data to make informed decisions • Winning stakeholders' support with data visualisations and dashboard • Predictive modelling • Future uses including internet of things, machine learning etc.

Using R and data science, any organisation can become data driven. With data and analytical skills demand higher than supply, more businesses need to know that R is part of the solution and that R is a great language to learn for individuals wanting to get into data science.

Moderators

Torben Tvedebrink

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Speakers

Alice Daish

Data Scientist, British Museum, R-Ladies Global Leadership

@alice_data

Tuesday June 28, 2016 1:18pm - 1:36pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:36pm PDT

CVXR: An R Package for Modeling Convex Optimization Problems

CVXR is an R package that provides an object-oriented modeling language for convex optimization. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known curvature and monotonicity properties. CVXR then applies signed disciplined convex programming (DCP) to verify the problem's convexity and, once verified, converts the problem into a standard conic form using graph implementations and passes it to an open-source cone solver such as ECOS or SCS. We demonstrate CVXR's modeling framework with several applications.

Moderators

Torben Tvedebrink

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Speakers

Anqi Fu

Life Science Research Professional, Stanford University

I am a Life Science Research Professional working with Dr. Stephen Boyd and Dr. Lei Xing on applications of convex optimization to radiation treatment planning. Prior to here, I was a Machine Learning Scientist at H2O.ai, developing and testing large-scale, distributed algorithms... Read More →

Tuesday June 28, 2016 1:36pm - 1:54pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:54pm PDT

Statistics and R in Forensic Genetics

Genetic evidence is often used as evidence in disputes. Mostly, the genetic evidence is DNA profiles and the disputes are often familial or crime cases. In this talk, we go through the statistical framework of evaluating genetic evidence by calculating an evidential weight. The focus will be the statistical aspects of how DNA material from the male Y chromosome can help resolve sexual assult cases. In particular, how an evidential weight of Y chromosomal DNA can be calculated using various statistical methods and how the methods use statistics and R. One of the methods is the discrete Laplace method which is a statistical model consisting of a mixture of discrete Laplace distributions (an exponential family). We demonstrate how inference for that method was initially done using R's built-in glm function with a new family function for the discrete Laplace distribution. We also explain how inference was speeded up by recognising the model as a weighted two-way layout with implicit model matrix and how this was implemented as a special case of iteratively reweighted least squares.

Moderators

Torben Tvedebrink

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Speakers

Mikkel Meyer Andersen

Assistant Professor, Department of Mathematical Sciences

I'm an applied statistician working with statistics for forensic genetics. R (as well as other programming languages and technologies) is a core part of my teaching and research. I was part of the local organising committee for useR! 2015 in Aalborg. useR! 2016 is my 3rd useR! conference... Read More →

Tuesday June 28, 2016 1:54pm - 2:12pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

2:12pm PDT

FiveThirtyEight's data journalism workflow with R

FiveThirtyEight is a data journalism site that uses R extensively for charts, stories, and interactives. We’ve used R for stories covering: p-hacking in nutrition science; how Uber is affecting New York City taxis; workers in minimum-wage jobs; the frequency of terrorism in Europe; the pitfalls in political polling; and many, many more.

R is used in every step of the data journalism process: for cleaning and processing data, for exploratory graphing and statistical analysis, for models deploying in real time as and to create publishable data visualizations. We write R code to underpin several of our popular interactives, as well, like the Facebook Primary and our historical Elo ratings of NBA and NFL teams. Heck, we’ve even styled a custom ggplot2 theme. We even use R code on long-term investigative projects.

In this presentation, I’ll walk through how cutting-edge, data-oriented newsrooms like FiveThirtyEight use R by profiling a series of already-published stories and projects. I’ll explain our use of R for chart-making in sports and politics stories; for the data analysis behind economics and science feature pieces; and for production-worthy interactives.

Moderators

Torben Tvedebrink

Associate Professor, Department of Mathematical Sciences, Aalborg University

I'm a statistician working in the area of forensic genetics. R is a core part of my teaching and research. I was the chair of the local organising committee for useR! 2015 in Aalborg, and part of the programme committee for useR! 2016 in Stanford.

Speakers

Andrew Flowers

Quantitative Editor, FiveThirtyEight

As the quantitative editor of FiveThirtyEight, I write stories about a variety of topics -- economics, politics, sports -- while also doing data science tasks for other staff writers. Before starting at FiveThirtyEight in 2013, I was at the Federal Reserve Bank of Atlanta.

Tuesday June 28, 2016 2:12pm - 2:30pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

4:45pm PDT

R at Microsoft

Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.

In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I’ll describe a couple of examples of R being used to analyze operational data at Microsoft. I’ll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.

Moderators

Jacqueline Meulman

Visiting Professor, Stanford University

Speakers

David Smith

Cloud Advocate, Microsoft

Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.

Tuesday June 28, 2016 4:45pm - 5:03pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

5:03pm PDT

'AF' a new package for estimating the attributable fraction

The attributable fraction (or attributable risk) is a widely used measure that quantifies the public health impact of an exposure on an outcome. Even though the theory for AF estimation is well developed, there has been a lack of up-to-date software implementations. The aim of this article is to present a new R package for AF estimation with binary exposures. The package AF allows for confounder-adjusted estimation of the AF for the three major study designs: cross-sectional, (possibly matched) case-control and cohort. The article is divided into theoretical sections and applied sections. In the theoretical sections we describe how the confounder-adjusted AF is estimated for each specific study design. These sections serve as a brief but self-consistent tutorial in AF estimation. In the applied sections we use real data examples to illustrate how the AF package is used. All datasets in these examples are publicly available and included in the AF package, so readers can easily replicate all analyses.

Moderators

Jacqueline Meulman

Visiting Professor, Stanford University

Speakers

Elisabeth Dahlqwist

Karolinska Institute

I am a Phd student at Karolinska Institutet, Department of medical epidemiology and biostatistics. The topic of my PhD is methodological developments of the attributable fraction. My first project have been to implement a R package which unifies methods for estimating the model-based... Read More →

Tuesday June 28, 2016 5:03pm - 5:21pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

5:21pm PDT

broom: Converting statistical models to tidy data frames

The concept of "tidy data" offers a powerful and intuitive framework for structuring data to ease manipulation, modeling and visualization, and has guided the development of R tools such as ggplot2, dplyr, and tidyr. However, most functions for statistical modeling, both built-in and in third-party packages, produce output that is not tidy, and that is therefore difficult to reshape, recombine, and otherwise manipulate. I introduce the package “broom," which turns the output of model objects into tidy data frames that are suited to further analysis and visualization with input-tidy tools. The package defines the tidy, augment, and glance methods, which arrange a model into three levels of tidy output respectively: the component level, the observation level, and the model level. These three levels can be used to describe many kinds of statistical models, and offer a framework for combining and reshaping analyses using standardized methods. Along with the implementations in the broom package, this offers a grammar for describing the output of statistical models that can be applied across many statistical programming environments, including databases and distributed applications.

Moderators

Jacqueline Meulman

Visiting Professor, Stanford University

Speakers

David Garrett Robinson

Heap Analytics

David Robinson is Director of Data Science at Heap Analytics, where he's helping to build the next generation of product analytics technology. He's the co-author with Julia Silge of the tidytext package and the O’Reilly book Text Mining with R. He also created the broom, fuzzyjoin... Read More →

Tuesday June 28, 2016 5:21pm - 5:39pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

5:39pm PDT

Rho: High Performance R

The Rho project (formerly known as CXXR) is working on transforming the current R interpreter into a high performance virtual machine for R. Using modern software engineering techniques and the research done on VMs for dynamic and array languages over the last twenty years, we are targeting a factor of ten speed improvement or better for most types of R code, while retaining full compatibility.

This talk will discuss the current compatibility and performance of the VM, the types of tasks it currently does well and outline the project's roadmap for the next year.

Moderators

Jacqueline Meulman

Visiting Professor, Stanford University

Speakers

Karl Millar

Google

Tuesday June 28, 2016 5:39pm - 5:57pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

5:57pm PDT

Colour schemes in data visualisation: Bias and Precision

The technique of mapping continuous values to a sequence of colours, is often used to visualise quantitative data. The ability of different colour schemes to facilitate data interpretation has not been thoroughly tested. Using a survey framework built with Shiny and loggr, we compared six commonly used colour schemes in two experiments: a measure of perceptually linearity and a map reading task for: (1) bias and precision in data interpretation, (2) response time and (3) colour preferences. The single-hue schemes were unbiased — perceived values did not consistently deviate from the true value, but very imprecise — large data variance between the perceived values. Schemes with hue transitions improved precision, however they were highly biased when not close to perceptually linearity (especially for the multi-hue ‘rainbow’ schemes). Response time was shorter for the single-hue schemes and longer for more complex colour schemes. There was no aesthetic preference for any of the colourful schemes. These results show that in choosing a colour scheme to communicate quantitative information, there are two potential pitfalls: bias and precision. Every use of colour to represent data should be aware of the bias--precision trade-off and select the scheme that balances these two potential communication errors.

Moderators

Jacqueline Meulman

Visiting Professor, Stanford University

Speakers

William K. Cornwell

UNSW, Australia

Into Ecology, Evolutionary Biology, and data visualization

Tuesday June 28, 2016 5:57pm - 6:15pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

10:30am PDT

Notebooks with R Markdown

Notebook interfaces for data analysis have compelling advantages including the close association of code and output and the ability to intersperse narrative with computation. Notebooks are also an excellent tool for teaching and a convenient way to share analyses. As an authoring format, R Markdown bears many similarities to traditional notebooks like Jupyter and Beaker, but it has some important differences. R Markdown documents use a plain-text representation (markdown with embedded R code chunks) which creates a clean separation between source code and output, is editable with the same tools as for R scripts (.Rmd modes are available for Emacs, Vim, Sublime, Eclipse, and RStudio), and works well with version control. R Markdown also features a system of extensible output formats that enable reproducible creation of production-quality output in many formats including HTML, PDF, Word, ODT, HTML5 slides, Beamer, LaTeX-based journal articles, websites, dashboards, and even full length books. In this talk we'll describe a new notebook interface for R that works seamlessly with existing R Markdown documents and displays output inline within the standard RStudio .Rmd editing mode. Notebooks can be published using the traditional Knit to HTML or PDF workflow, and can also be shared with a compound file that includes both code and output, enabling readers to easily modify and re-execute the code. Building a notebook system on top of R Markdown carries forward its benefits (plain text, reproducible workflow, and production quality output) while enabling a richer, more literate workflow for data analysis.

Moderators

Karthik Ram

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Speakers

J.J. Allaire

RStudio

Wednesday June 29, 2016 10:30am - 10:48am PDT
McCaw Hall

Contributed talk, Kaleidoscope

10:48am PDT

Visualizing Simultaneous Linear Equations, Geometric Vectors, and Least-Squares Regression with the matlib Package for R

The aim of the matlib package is pedagogical --- to help teach concepts in linear algebra, matrix algebra, and vector geometry that are useful in statistics. To this end, the package includes various functions for numerical linear algebra, most of which duplicate capabilities available elsewhere in R, but which are programmed transparently and purely in R code, including functions for solving possibly over- or under-determined linear simultaneous equations, for computing ordinary and generalized matrix inverses, and for producing various matrix decompositions. Many of these methods are implemented via Gaussian elimination. This paper focuses on the visualization facilities in the matlib package, including for graphing the solution of linear simultaneous equations in 2 and 3 dimensions; for demonstrating vector geometry in 2 and 3 dimensions; and for displaying the vector geometry of least-squares regression. We illustrate how these visualizations help to communicate fundamental ideas in linear algebra, vector geometry, and statistics. The 3D visualizations are implemented using the rgl package.

Moderators

Karthik Ram

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Speakers

John Fox

Professor, McMaster University

Wednesday June 29, 2016 10:48am - 11:06am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:06am PDT

On the emergence of R as a platform for emergency outbreak response

The recent Ebola virus disease outbreak in West Africa has been a terrible reminder of the necessities of rapid evaluation and response to emerging infectious disease threats. For such response to be fully informed, complex epidemiological data including dates of symptom onsets, locations of the cases, hospitalisation, contact tracing information and pathogen genome sequences have to be analysed in near real time. Integrating all these data to inform Public Health response is a challenging task, which typically involves a variety of visualisation tools and statistical approaches. However, a unified platform for outbreak analysis has been lacking so far. Some recent collaborative efforts, including several international hackathons, have been made to address this issue. This talk will provide an overview of the current state of R as a platform for the analysis of disease outbreaks, with an emphasis on lessons learnt from a direct involvement with the recent Ebola outbreak response.

Moderators

Karthik Ram

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Speakers

Thibaut Jombart

Lecturer, Imperial College London

About me ------------- Biostatistician working on disease outbreak response, pathogen population genetics, phylogenetics, and some random other things; R developer of adegenet, adephylo, apex, bmmix, dibbler, geoGraph, outbreaker(2), treescape, vimes, and a few more shameful projects... Read More →

Wednesday June 29, 2016 11:06am - 11:24am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:24am PDT

Network Diffusion of Innovations in R: Introducing netdiffuseR

The Diffusion of Innovations theory, while one of the oldest social science theories, has embedded and flowed in its popularity over its 100 year or so history. In contrast to contagion models, diffusion of innovations can be more complex since adopting an innovation usually requires more than simple exposure to other users. At the same time, although computational tools for data collection, analysis, and network research have advanced considerably with little parallel develop of diffusion network models. To address this gap, we have created the netdiffuseR R package. The netdiffuseR package implements both classical and novel diffusion of innovations models, visualization methods, and data-management tools for the statistical analysis of network diffusion data. The netdiffuseR package goes further by allowing researchers to analyze relatively large datasets in a fast and reliable way, extending current network analysis methods for studying diffusion, thus serving as a great complement to other popular network analysis tools such as igraph, statnet or RSiena. netdiffuseR can be used with new empirical data, with simulated data, or with existing empirical diffusion network datasets.

Moderators

Karthik Ram

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Speakers

George G. Vega Yon

University of Southern California

Wednesday June 29, 2016 11:24am - 11:42am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:42am PDT

Classifying Murderers in Imbalanced Data Using randomForest

In order to allocate resources more effectively with the goal of providing safer communities, R's randomForest algorithm was used to identify candidates who may commit or attempt murder. And while crime data within the general population may be highly imbalanced, one may expect the rate of murderers within a high-risk probationer population to be much less imbalanced. However, the County of Los Angeles had nearly 130 probationers commit or attempt murder out of nearly 17,000, a ratio close to 1:130). Classic methods were used to overcome class imbalance, including under/over stratified sampling and variable sampling per tree. The results were encouraging. Model validation tests demonstrate an 87% overall accuracy rate at relatively low costs. The agency currently uses a risk assessment tool that was outperformed by randomForest up to 52% (both in overall accuracy and a reduction in false positives). This work is based on research conducted by Berk, R. et al. (2009) originally published by Journal of the Royal Statistical Society.

Moderators

Karthik Ram

co-founder, rOpenSci

Karthik Ram is a co-founder of ROpenSci, and a data science fellow at the University of California's Berkeley Institute for Data Science. Karthik primarily works on a project that develops R-based tools to facilitate open science and access to open data.

Speakers

Jorge Alberto Miranda

Analyst, County of Los Angeles

I have been an R user since 2013 when I first started working with a data reporting team at the Los Angeles County Probation Department. One of my goals in life is to convert more of my colleagues into R users and make R part of the County toolkit. With nearly 100,000 employees, I... Read More →

Wednesday June 29, 2016 11:42am - 12:00pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:00pm PDT

R markdown: Lifesaver or death trap?

The popularity of R markdown is unquestionable, but will it prove as useful to the blind community as it is for our sighted peers? The short answer is "yes" but the more realistic answer is that it depends on so many other aspects some of which will remain outside the skill sets of many authors. Source R markdown files are plain text files, and are therefore totally accessible for a blind user. The documents generated from these source files for end-users differ in their accessibility; HTML is great and a pdf generated using LaTeX is very limited. International standards exist for ensuring most document formats are accessible, but the TeX ncommunity has not yet developed a tool for generating an accessible pdf document from any form of LaTeX source. There is little hope for any pdf containing mathematical expressions or graphical content. In contrast, the HTML documents created from R markdown can contain many aspects of accessibility with little or no additional work required from a document's author. A substantial problem facing any blind author wishing to create an HTML document from their R markdown files is that there is no simple editor available that is accessible; RStudio is not an option that can be used by blind people; until such time as an alternative tool becomes available, blind people will either have to use cumbersome work-arounds or rely on a small application we have built specifically for editing and processing R markdown documents.

Moderators

Rasmus Arnling Bååth

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Speakers

A. Jonathan R. Godfrey

Institute of Fundamental Sciences, Massey University

Wednesday June 29, 2016 1:00pm - 1:18pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:18pm PDT

A Lap Around R Tools for Visual Studio

R Tools for Visual Studio is a new, Open Source and free tool for R Users built on top of the powerful Visual Studio IDE. In this talk, we will take you on a tour of its features and show you how they can help you be a more productive R user. We will look at: -Integrated debugging support -Variable/data frame visualization -Plotting and help integration -Using the Editor and REPL in concert with each other -RMarkdown and Shiny integration -Using Excel and SQL Server -Extensions and source control

Moderators

Rasmus Arnling Bååth

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Speakers

John Lam

Microsoft

Wednesday June 29, 2016 1:18pm - 1:36pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:36pm PDT

Adding R, Jupyter and Spark to the toolset for understanding the complex computing systems at CERN's Large Hadron Collider

High Energy Physics (HEP) has a decades long tradition of statistical data analysis and of using large computing infrastructures. CERN's current flagship project LHC has collected over 100 PB of data, which is analysed in a wold-wide distributed computing grid by millions of jobs daily. Being a community with several thousand scientists, HEP also has a tradition of developing its own analysis toolset. In this contribution we will briefly outline the core physics analysis tasks and then focus on applying data analysis methods also to understand and optimise the large and distributed computing systems in the CERN computer centre and the world-wide LHC computing grid. We will describe the approach and tools picked for the analysis of metrics about job performance, disk and network I/O and the geographical distribution and access to physics data. We will present the technical and non-technical challenges in optimising a distributed infrastructure for large scale science projects and will summarise the first results obtained.

Moderators

Rasmus Arnling Bååth

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Speakers

Dirk Duellmann

Analysis & Design - Storage Group, CERN

quantitative understanding of large computing and storage systems

Wednesday June 29, 2016 1:36pm - 1:54pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

1:54pm PDT

How to do one's taxes with R

In this talk it is shown how to generate a return of tax (German VAT) with R and send it over the internet to the tax administration. As this is certainly not a standard application for R (special software exists for this purpose) it may be worthwhile to have a closer look at the techniques used to realize such kind of transaction and to reveal any analogies to distributed data analysis. If confidential data cannot be analysed in the environment where it is created or stored, it has to be transferred over the internet to some kind of nexecution service, e.g. a cluster system. Encryption is necessary to protect the data as well as appending a digital signature to guarantee ownership nand prevent modification. Additionally some kind of packaging has to be applied to the data together with metadata giving directions for the receiver to handle the delivery. When returning the result the same techniques are used. So again privacy and authorship are ensured. For the tax example all these procedures have to observe well established cryptographic standards for encryption, hashing and digital signatures which change from time to time according to new results in cryptographic research. I demonstrate an implemenation in R for this kind of transaction in a data science context, trying to use the same rigorous standards mentioned above whenever possible. This leads to an overview of existing R packages and external software useful and necessary to realize a corresponding program. Finally some proposals for a possible standardization of a secure distributed data analysis scenario are presented.

Moderators

Rasmus Arnling Bååth

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Speakers

Benno Süselbeck

University of Muenster, Center for Information Processing

Wednesday June 29, 2016 1:54pm - 2:12pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

2:12pm PDT

Using R in a regulatory environment: FDA experiences.

The Food and Drug Administration (FDA) regulates products which account for approximately one fourth of consumer spending in the United States of America, and has global impact, particularly for medical products. This talk will discuss the Statistical Software Clarifying Statement (http://www.fda.gov/ForIndustry/DataStandards/StudyDataStandards/ucm445917.htm), which corrects the misconception that FDA requires the use of proprietary software for FDA submissions. Next, we will describe several use cases for R at FDA, including review work, research, and collaborations with industry, academe and other government agencies. We describe advantages, challenges and opportunities of using R in a regulatory setting. Finally, we close with a brief demonstration of a Shiny openFDA application for the FDA Adverse Event Reporting System (FAERS) available at https://openfda.shinyapps.io/LRTest/.

Moderators

Rasmus Arnling Bååth

Data Scientist, King

I'm a Data scientist at King interested all things stats, but if it's Bayesian I'm especially interested.

Speakers

Paul H Schuette

Mathematical Statistician, FDA

Regulatory Science, FDA

Wednesday June 29, 2016 2:12pm - 2:30pm PDT
McCaw Hall

Contributed talk, Kaleidoscope

10:30am PDT

Shiny Gadgets: Interactive tools for Programming and Data Analysis

A Shiny Gadget is an interactive tool that enhances your R programming experience. You make Shiny Gadgets with the same package that you use to make Shiny Apps, but you use Gadgets in a very different way. Where Shiny Apps are designed to communicate results to an end user, Gadgets are designed to generate results for an R user. Each Shiny Gadget returns a value that you can immediately use in your code. You use Shiny Gadgets during the course of your analysis to quickly hone iterative tasks in an interactive fashion. For example, you might use a Shiny Gadget to preview the matches that are generated by a regular expression–as you write the expression. Or you might use a Shiny Gadget to identify high leverage points in your model–as you fit the model. Unlike Shiny Apps, Shiny Gadgets do not need to be deployed on a server. Shiny Gadgets are defined right inside of a regular R function. This is important, because it means that Gadgets can directly access the function’s arguments, and the return value of the Gadget can be the return value for the function. Despite this difference, almost everything you know about Shiny Apps will transfer over to writing Shiny Gadgets. nnReady to see what Gadgets are all about? Attend this talk for some inspiring examples. The talk will also introduce the miniUI package, a collection of layout elements that are well-suited to Shiny Gadgets.

Moderators

Julie Josse

INRIA/Agrocampus

Speakers

Garrett Grolemund

Educator, RStudio

Thursday June 30, 2016 10:30am - 10:48am PDT
McCaw Hall

Contributed talk, Kaleidoscope

10:48am PDT

Authoring Books with R Markdown

Markdown is a simple and popular language for writing. R Markdown (http://rmarkdown.rstudio.com) has made it really easy to author documents that contain R code, and convert these documents to a variety of output formats, including PDF, HTML, Word, and presentations. There are still some missing pieces in the toolchain, especially when writing long-form articles and books, such as cross-references, automatic numbering of figures/tables, multiple-page HTML output and a navigation system, and so on. The R package bookdown has solved all these problems for several types of output formats, such as HTML, PDF, EPUB and MOBI e-books. The visual style of the book is customizable. When the output format is HTML, the book may contain interactive components, such as HTML widgets and Shiny apps, so readers may interact with certain examples in the book in real time (screenshots of these examples will be automatically taken and used when the output format is non-HTML). In this talk, we will give a quick tour through the bookdown package, and show how to quickly get started with writing a book. We will also talk about various options for editing, hosting, and publishing a book. Our goal is that authors can focus as much as possible on the content of the book, instead of spending too much time on any complicated non-portable syntax of authoring languages, or tools for converting books to different output formats. In other words, ``one ring to rule them all.''

Moderators

Julie Josse

INRIA/Agrocampus

Speakers

Hadley Wickham

Chief Scientist, RStudio

Hadley is Chief Scientist at RStudio, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes... Read More →

Yihui Xie

Software Engineer, RStudio, PBC

Yihui Xie is a software engineer at RStudio. He earned his PhD from the Department of Statistics, Iowa State University. He has authored and co-authored several R packages, such as knitr, rmarkdown, bookdown, blogdown, and xaringan. He has published a number of books, including “Dynamic... Read More →

Thursday June 30, 2016 10:48am - 11:06am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:06am PDT

Estimation of causal effects in network-dependent data

We describe two R packages which facilitate causal inference research in network-dependent data: \pkg{simcausal} package for conducting network-based simulation studies; and \pkg{tmlenet} package for the estimation of various causal effects in \pkg{simcausal}-simulated, or real-world network datasets. In addition to the estimation of various causal effects, the \pkg{tmlenet} package implements several approaches to estimation of standard errors for dependent (non-IID) data with known network structure. Both packages implement a new syntax that repurposes the list indexing operator '$\texttt{[[…]]}$' for specifying complex network-based data summaries of the observed covariates. For example, $\texttt{sum(A[[1:Kmax]])}$ will specify a network-driven summary, evaluated for each unit $i$ as a sum of the variable $\texttt{A}$ values for all “friends” of $i$. This new syntax is fully generalizable towards any type of user-defined functions and any type of networks. The practical applicability of both packages is then illustrated with a large-scale simulation study of a hypothetical highly-connected community with an intervention that aimed to increase the level of physical activity by (i) educating a simulated study population of connected subjects, and/or (ii) by intervening on the network structure itself. We will describe how our work can be extended to complex network processes that evolve over time, and discuss possible avenues for future research on estimation of causal effects in longitudinal network settings.

Moderators

Julie Josse

INRIA/Agrocampus

Speakers

Oleg Sofrygin

University of California, Berkeley

Thursday June 30, 2016 11:06am - 11:24am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:24am PDT

DataSHIELD: Taking the analysis to the data

Irrespective of discipline, data access and analysis barriers result from a range of scenarios: * ethical-legal restrictions surrounding confidentiality and the sharing of, or access to, disclosive data; * intellectual property or licensing issues surrounding research access to raw data; *the physical size of the data is a limiting factor. DataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data from different sources, without disclosing sensitive information. DataSHIELD comprises a series of R packages enabling the researcher to perform distributed analysis on the individual level data, whilst satisfying the strict ethical-legal-governance restrictions related to sharing this data type. Furthermore, under the DataSHIELD infrastructure - set up as a client-server model - raw data never leaves the data provider (the server) and no individual level data can be seen by the researcher (the client). Base functionality in the DataSHIELD R packages includes descriptive stats (e.g. mean), exploratory stats (e.g. histogram), contingency tables (1-dimensional and 2-dimensional frequency tables) and modelling (survival analysis using piecewise exponential regression, glm). The modular nature of DataSHIELD has allowed the scoping of additional data types to expand DataSHIELD functionality with respect to genomic, text and geospatial data. Different infrastructure models are also possible - tailored for pooled co-analysis, single site analysis and linked data analysis. DataSHIELD has been successfully piloted in two European biomedical studies, sharing data across 14 different biobanks to investigate healthy obesity and the effect of environmental determinants on health. It is of proven value in the biomedical and social science domains, but has potential utility wider than this.

Moderators

Julie Josse

INRIA/Agrocampus

Speakers

Becca Wilson

Research Fellow, University of Bristol

Medical data sharing; distributed computing; privacy protected data analysis.

Thursday June 30, 2016 11:24am - 11:42am PDT
McCaw Hall

Contributed talk, Kaleidoscope

11:42am PDT

Most Likely Transformations

The "mlt" package implements maximum likelihood estimation in the class of conditional transformation models. Based on a suitable explicit parameterisation of the unconditional or conditional transformation function using infrastructure from package "basefun", we show how one can define, estimate and compare a cascade of increasingly complex transformation models in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable are set-up and estimated in the same computational framework simply by choosing an appropriate transformation function and parameterisation thereof. As it is computationally cheap to evaluate the distribution function, models can be estimated by maximisation of the exact likelihood, especially in the presence of random censoring or truncation. The relatively dense high-level implementation in the "R" system for statistical computing allows generalisation of many established implementations of linear transformation models, such as the Cox model or other parametric models for the analysis of survival or ordered categorical data, to the more complex situations illustrated in this paper.

Moderators

Julie Josse

INRIA/Agrocampus

Speakers

Torsten Hothorn

University of Zurich

Thursday June 30, 2016 11:42am - 12:00pm PDT
McCaw Hall

Contributed talk, Kaleidoscope