This is an intermediate/advanced level tutorial on dynamic documents with R Markdown. It starts with the basic idea of literate programming as well as its role in reproducible research. Among all document formats that knitr supports, we will only focus on R Markdown (.Rmd). We will give an overview of existing output formats in rmarkdown, and explain how to customize them. We will show how to build new output format functions by extending exising formats. The packages tufte and bookdown will be used as examples. We will mention other applications related to R Markdown such as HTML widgets [Vaidyanathan et al., 2015], Shiny documents [Chang et al., 2015], and how to run code from other languages (C, C++, and so on).
For details, refer to tutorial description.For complex traits, such as cardiometabolic disease, we increasingly recognize that the intergeneric space between protein coding genes (PCGs) contains highly ordered regulatory elements that control expression and function of PCGs and in themselves can be actively transcribed molecules. Indeed, over 50% of genome-wide association studies (GWAS) of complex traits identify single nucleotide polymorphisms (SNPs) that fall in intergenic regions and it is only recently becoming apparent that these regions are highly organized to perform specific functions. A next step in advancing precision medicine is careful and rigorous interrogation of the role of these regulatory elements, and their interplay with known PCGs and environmental factors, in the heritability of complex disease phenotypes. This tutorial focuses on analytic techniques and R tools designed to uncover these complex, and largely uncharacterized relationships.
For details, refer to tutorial description.The goal of this tutorial is to provide participants with a deep understanding of four widely used algorithms in machine learning:Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets. This includes a deep dive into the algorithms in the abstract sense, and a review of the implementations of these algorithms available within the R ecosystem.
Due to their popularity, each of these algorithms have several implementations available in R. Each package author takes a unique approach to implementing the algorithm, and each package provides an overlapping, but not identical, set of model parameters available to the user. The tutorial will provide an in-depth analysis of how each of these algorithms were implemented in a handful of R packages for each algorithm.
After completing this tutorial, participants will have a understanding of how each of these algorithms work, and knowledge of the available R implementations and how they differ. The participants will understand, for example, why the xgboost package has, in less than a year, become one of the most popular GBM packages in R, even though the gbm R package has been around for years and has been widely used -- what are the implementation tricks used in xgboost that are not (yet) used in the gbm package? Or, why do some practioners in certain domains prefer the one implementation over another? We will answer these questions and more!
For details, refer to tutorial description.The tutorial will introduce different types of statistical methods for the analysis of survey data to produce estimates for small domains (sometimes termed ‘small areas’). This will include design-based estimators, that are only based on the study design and observed data, and model-based estimators, that rely on an underlying model to provide estimates. The tutorial will cover frequentist and Bayesian inference for Small Area Estimation. All methods will be accompanied by several examples that attendants will be able to reproduce.
This tutorial will be roughly based on the tutorial presented at useR! 2008 but will include updated materials. In particular, it will cover new R packages that have appeared since then.
For details, refer to tutorial description.In the realm of marketing analytics, time to event modeling at the customer level can provide a more granular view of the incremental impact that marketing campaigns have on individuals. Media that is addressable can be mapped to an individual, and even aggregated data can be mapped down to an individual via various techniques (i.e. geo, dma, etc.). To accurately assess the incremental effect of marketing, a primary task during modeling is not only to estimate the magnitude/amplitude of the marketing effect, but also to capture the differing decay rates that each specific one has.
This tutorial will describe the basic techniques of applying time-to-event statistical modeling techniques to marketing analytics problems. Beginning with data preparation, sampling, outlier detection and techniques to control for non-marketing effects, the tutorial will move on to consider various modeling strategies and methods for evaluating model effectiveness. The techniques and processes presented will mimic a typical marketing analytics workflow. We will be using a random sample from a (anonymized) large retail firm.
For details, refer to tutorial description.Data analysts can use the Git version control system to manage a motley assortment of project files in a sane way (e.g., data, code, reports, etc.). This has benefits for the solo analyst and, especially, for anyone who wants to communicate and collaborate with others. Git helps you organize your project over time and across different people and computers. Hosting services like GitHub, Bitbucket, andGitLab provide a home for your Git-based projects on the internet.
What's special about using R and Git(Hub)?
This is an intermediate/advanced level tutorial on dynamic documents with R Markdown. It starts with the basic idea of literate programming as well as its role in reproducible research. Among all document formats that knitr supports, we will only focus on R Markdown (.Rmd). We will give an overview of existing output formats in rmarkdown, and explain how to customize them. We will show how to build new output format functions by extending exising formats. The packages tufte and bookdown will be used as examples. We will mention other applications related to R Markdown such as HTML widgets [Vaidyanathan et al., 2015], Shiny documents [Chang et al., 2015], and how to run code from other languages (C, C++, and so on).
For details, refer to tutorial description.For complex traits, such as cardiometabolic disease, we increasingly recognize that the intergeneric space between protein coding genes (PCGs) contains highly ordered regulatory elements that control expression and function of PCGs and in themselves can be actively transcribed molecules. Indeed, over 50% of genome-wide association studies (GWAS) of complex traits identify single nucleotide polymorphisms (SNPs) that fall in intergenic regions and it is only recently becoming apparent that these regions are highly organized to perform specific functions. A next step in advancing precision medicine is careful and rigorous interrogation of the role of these regulatory elements, and their interplay with known PCGs and environmental factors, in the heritability of complex disease phenotypes. This tutorial focuses on analytic techniques and R tools designed to uncover these complex, and largely uncharacterized relationships.
For details, refer to tutorial description.The goal of this tutorial is to provide participants with a deep understanding of four widely used algorithms in machine learning:Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets. This includes a deep dive into the algorithms in the abstract sense, and a review of the implementations of these algorithms available within the R ecosystem.
Due to their popularity, each of these algorithms have several implementations available in R. Each package author takes a unique approach to implementing the algorithm, and each package provides an overlapping, but not identical, set of model parameters available to the user. The tutorial will provide an in-depth analysis of how each of these algorithms were implemented in a handful of R packages for each algorithm.
After completing this tutorial, participants will have a understanding of how each of these algorithms work, and knowledge of the available R implementations and how they differ. The participants will understand, for example, why the xgboost package has, in less than a year, become one of the most popular GBM packages in R, even though the gbm R package has been around for years and has been widely used -- what are the implementation tricks used in xgboost that are not (yet) used in the gbm package? Or, why do some practioners in certain domains prefer the one implementation over another? We will answer these questions and more!
For details, refer to tutorial description.The tutorial will introduce different types of statistical methods for the analysis of survey data to produce estimates for small domains (sometimes termed ‘small areas’). This will include design-based estimators, that are only based on the study design and observed data, and model-based estimators, that rely on an underlying model to provide estimates. The tutorial will cover frequentist and Bayesian inference for Small Area Estimation. All methods will be accompanied by several examples that attendants will be able to reproduce.
This tutorial will be roughly based on the tutorial presented at useR! 2008 but will include updated materials. In particular, it will cover new R packages that have appeared since then.
For details, refer to tutorial description.In the realm of marketing analytics, time to event modeling at the customer level can provide a more granular view of the incremental impact that marketing campaigns have on individuals. Media that is addressable can be mapped to an individual, and even aggregated data can be mapped down to an individual via various techniques (i.e. geo, dma, etc.). To accurately assess the incremental effect of marketing, a primary task during modeling is not only to estimate the magnitude/amplitude of the marketing effect, but also to capture the differing decay rates that each specific one has.
This tutorial will describe the basic techniques of applying time-to-event statistical modeling techniques to marketing analytics problems. Beginning with data preparation, sampling, outlier detection and techniques to control for non-marketing effects, the tutorial will move on to consider various modeling strategies and methods for evaluating model effectiveness. The techniques and processes presented will mimic a typical marketing analytics workflow. We will be using a random sample from a (anonymized) large retail firm.
For details, refer to tutorial description.Data analysts can use the Git version control system to manage a motley assortment of project files in a sane way (e.g., data, code, reports, etc.). This has benefits for the solo analyst and, especially, for anyone who wants to communicate and collaborate with others. Git helps you organize your project over time and across different people and computers. Hosting services like GitHub, Bitbucket, andGitLab provide a home for your Git-based projects on the internet.
What's special about using R and Git(Hub)?
The Stan project implements a probabalistic programming language, a library of mathematical and statistical functions, and a variety of algorithms to estimate statistical models in order to make Bayesian inferences from data. The three main sections of this tutorial will
data.table is known for its speed on large data in RAM (e.g. 100GB) but it also has a consistent and flexible syntax for more advanced data manipulation tasks on small data too. First released to CRAN in 2006 it continues to grow in popularity. 180 CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted 4,000 questions from users in many fields making it a top 3 asked about R package. It is the 7th most starred R package on GitHub.
This three hour tutorial will guide complete beginners from basic queries through to advanced topics via examples you will run on your laptop. There is a short learning curve to data.table but once it clicks it sticks.
For details, refer to tutorial description.The art of data analysis concerns using flexible statistical models, choosing tools wisely, avoiding overfitting, estimating quantities of interest, making statistical inferences and predictions, validating predictive accuracy, graphical presentation of complex models, and many other important techniques. Regression models can be extended in a number of ways to meet many of the modern challenges in data analysis. Software that makes it easier to incorporate modern statistical methods and good statistical practice removes obstacles and leads to greater insights from data. The presenter has striven to bring modern regression, missing data imputation, data reduction, and bootstrap model validation techniques into everyday practice by writing Regression Modeling Strategies (Springer, 2015, 2nd edition) and by writing an R package rms that accompanies the book. Detailed information may be found athttp://biostat.mc.vanderbilt.edu/rms.
The tutorial will cover two chapters in Regression Modeling Strategies related to general aspects of multivariable regression, relaxing linearity assumptions using restricted cubic splines, multivariable modeling strategy, and a brief introduction to bootstrap model validation. The rms package will be introduced, and at least two detailed case studies using the package will be presented. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.
For details, refer to tutorial description.An interactive graphic invites the viewer to become an active partner in the analysis and allows for immediate feedback on how the data and results may change when inputs are modified. Interactive graphics can be extremely useful for exploratory data analysis, for teaching, and for reporting.
Because there are so many different kinds of interactive graphics, there has been an explosion in R packages that can produce them (e.g. animint, shiny, rCharts, rMaps, ggvis, htmlwidgets). A beginner with little knowledge of interactive graphics can thus be easily confused by (1) understanding what kinds of graphics are useful for what kinds of data, and (2) finding an R package that can produce the desired type of graphic. This tutorial solves these two problems by (1) introducing a vocabulary of keywords for understanding the different kinds of graphics, and (2) explaining what R packages can be used for each kind of graphic.
Attendees will gain hands-on experience with using R to create interactive graphics. We will discuss several example data sets and several R interactive graphics packages. Attendees will learn a vocabulary that helps to understand the strengths and weaknesses of the many different packages which are currently available.
For details, refer to tutorial description.This tutorial introduces the Jupyter notebook project (previously called IPython notebooks). The tutorial will describe how Jupyter can be used for:
The Stan project implements a probabalistic programming language, a library of mathematical and statistical functions, and a variety of algorithms to estimate statistical models in order to make Bayesian inferences from data. The three main sections of this tutorial will
data.table is known for its speed on large data in RAM (e.g. 100GB) but it also has a consistent and flexible syntax for more advanced data manipulation tasks on small data too. First released to CRAN in 2006 it continues to grow in popularity. 180 CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted 4,000 questions from users in many fields making it a top 3 asked about R package. It is the 7th most starred R package on GitHub.
This three hour tutorial will guide complete beginners from basic queries through to advanced topics via examples you will run on your laptop. There is a short learning curve to data.table but once it clicks it sticks.
For details, refer to tutorial description.The art of data analysis concerns using flexible statistical models, choosing tools wisely, avoiding overfitting, estimating quantities of interest, making statistical inferences and predictions, validating predictive accuracy, graphical presentation of complex models, and many other important techniques. Regression models can be extended in a number of ways to meet many of the modern challenges in data analysis. Software that makes it easier to incorporate modern statistical methods and good statistical practice removes obstacles and leads to greater insights from data. The presenter has striven to bring modern regression, missing data imputation, data reduction, and bootstrap model validation techniques into everyday practice by writing Regression Modeling Strategies (Springer, 2015, 2nd edition) and by writing an R package rms that accompanies the book. Detailed information may be found athttp://biostat.mc.vanderbilt.edu/rms.
The tutorial will cover two chapters in Regression Modeling Strategies related to general aspects of multivariable regression, relaxing linearity assumptions using restricted cubic splines, multivariable modeling strategy, and a brief introduction to bootstrap model validation. The rms package will be introduced, and at least two detailed case studies using the package will be presented. The methods covered will apply to almost any regression model, including ordinary least squares, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models.
For details, refer to tutorial description.An interactive graphic invites the viewer to become an active partner in the analysis and allows for immediate feedback on how the data and results may change when inputs are modified. Interactive graphics can be extremely useful for exploratory data analysis, for teaching, and for reporting.
Because there are so many different kinds of interactive graphics, there has been an explosion in R packages that can produce them (e.g. animint, shiny, rCharts, rMaps, ggvis, htmlwidgets). A beginner with little knowledge of interactive graphics can thus be easily confused by (1) understanding what kinds of graphics are useful for what kinds of data, and (2) finding an R package that can produce the desired type of graphic. This tutorial solves these two problems by (1) introducing a vocabulary of keywords for understanding the different kinds of graphics, and (2) explaining what R packages can be used for each kind of graphic.
Attendees will gain hands-on experience with using R to create interactive graphics. We will discuss several example data sets and several R interactive graphics packages. Attendees will learn a vocabulary that helps to understand the strengths and weaknesses of the many different packages which are currently available.
For details, refer to tutorial description.This tutorial introduces the Jupyter notebook project (previously called IPython notebooks). The tutorial will describe how Jupyter can be used for:
Bell Labs in the 1970s was a hotbed of research in computing, statistics and many other fields. The conditions there encouraged the growth of the S language and influenced its content. The 40th anniversary of S is an appropriate time to relate a personal view of that scene and reflect on why S (and R) turned out as it did.
Global Climate Models (GCMs) can be used to assess the impacts of future climate change on particular regions of interest, municipalities or pieces of infrastructure. However, the coarse spatial scale of GCM grids (50km or more) can be problematic to engineers or others interested in more localized conditions. This is particularly true in areas with high topographic relief and/or substantial climate heterogeneity. A technique in climate statistics known as "downscaling" exists to map coarse scale climate quantities to finer scales.
Two main challenges are posed to potential climate downscalers: there exist few open source implementations of proper downscaling methods and many downscaling methods necessarily require information across both time and space. This requires high computational complexity to implement and substantial computational resources to execute.
The Pacific Climate Impacts Consortium in Victoria, BC, Canada has written a high-performance implementation of the detrended quantile mapping (QDM) method for downscaling climate variables. QDM is a proven climate downscaling technique that has been shown to preserve the relative changes in both the climate means and the extremes. We will release this software named "ClimDown" to CRAN under an open source license. Using ClimDown, researchers can downscale spatially coarse global climate models to an arbitrarily fine resolution for which gridded observations exist.
Our proof-of-concept for ClimDown has been to downscale all of Canada at 10km resolution for models and scenarios from the IPCC's Coupled Model Intercomparison Project. We will present our results and performance metrics from this exercise.